Talk Abstracts
Wide Data vs. Big Data [28.23 MB PDF]
Alyssa Goodman
In life, more of the same isn't always better. Often, variety is more important. In Science, using more kinds of data, rather than just more data, often helps answer the hardest questions. In this talk, I will discuss tools for visualizing several data sets at once. In some cases, the visualizations are straightforward (such as overlaying layers imagery or catalog data on images, in tools like WorldWide Telescope or Aladin), but in other cases, more subtle "linked-view" visualization amongst data sets and visualization types yields the deepest insight (e.g. using tools like Glue). The talk will include demonstrations of software using real data sets, both large and small, and will illustrate why data diversity is at least as important a concept as data volume.
Authors: Alyssa Goodman
How do you look at a billion data points? Exploratory Visualization for Big Data [1.16 MB PDF]
Carlos Scheidegger
Consider exploration of large multidimensional spatiotemporal datasets with billions of entries. Are certain attributes correlated spatially or temporally? How do we even look at data of this size? In this talk, I will present the techniques and algorithms to compute and query a nanocube, a data structure that enables interactive visualizations of data sources in the range of billions of elements.
Data cubes are widely used for exploratory data analysis. Although they are sometimes assumed to take a prohibitively large amount of space (and to consequently require disk storage), nanocubes fit in a modern laptop's main memory, even for hundreds of millions of entries. I will present live demos of the technique on a variety of real-world datasets, together with comparisons to the previous state of the art with respect to memory, timing, and network bandwidth measurements.
Authors: Carlos Scheidegger
Machine Vision Methods for the Diffuse Universe [29.48 MB PDF]
Joshua Peek
Data science in astronomy usually focuses on tabular data: rows and columns. This methodology often matches well to the science of astronomy, as we often examine a black sky containing individual, easy to discretize objects like stars and galaxies. However, to understand how these objects form we must investigate the spatial and kinematic structure of the diffuse universe, which holds the vast majority of baryons. I will discuss a number of machine-vision methods for interpreting the diffuse sky, with a focus on the interstellar medium. I will show that morphological information in the diffuse universe can be recovered, and that it can be used to predict and measure underlying physical quantities.
Authors: J. E. G. Peek
Visualization an Analysis of Rich Spectral-Line Datasets [4.29 MB PDF]
Elisabeth Mills
The visualization and analysis of ever more complicated multi-dimensional data is the limiting factor in our ability to exploit the capabilities of modern radio interferometers. Traditional methods for the understanding of radio data cubes do not scale well to the much richer data sets of today. The challenges these data sets present will only grow worse as our instruments grow more sensitive and the data volumes become larger. Previously specialized cases, such as dealing with complex velocity structures and extreme spectral complexity will soon become the norm, requiring new approaches. There are various projects currently addressing the need for improved visualization tools, however, most of these focus on trying to deal with larger data volumes with existing techniques. To augment these, we have begun a collaboration with the Scientific Computing and Imaging Institute at the University of Utah to develop new techniques to assist in the analysis of rich spectral line data sets that take advantage of previously developed techniques for visualizing high-dimensional datasets. Ultimately, the goal is to reduce the cognitive load on astronomers by maintaining and intuitively displaying the relations among the spatial regions, frequency and velocity axis and molecular species, to facilitate a physical understanding of the astronomical region.
Authors: E.A.C. Mills, J. Kern, J. Corby, B. Kent
Leveraging Annotated Archival Data with Domain Adaptation to Improve Data Triage in Optical Astronomy [1.49 MB PDF]
Brian Bue
Recent efforts have demonstrated the potential of deploying automated “data triage” systems in the science data processing pipelines of optical astronomical surveys. Of these, the astronomical transient detection pipeline at intermediate Palomar Transient Factory (iPTF) is a notable success. Rather than relying on human eyes to examine and analyze all collected data, iPTF relies upon a “Real-Bogus” classifier to vet candidate transient sources detected by its image subtraction pipeline. An effective Real-Bogus classifier filters out bogus candidates (e.g., image artifacts), and preserves candidates that potentially represent real astronomical transients, and allows domain experts to focus attention on observations worthy of spectroscopic follow-up, while reducing the time and effort necessary to manually filter out false detections.
In general, deploying an effective data triage system demands that (1) a representative set of annotated data be available to train the classifier, and (2) new data be measured in a similar fashion as the training data. However, these conditions are often not satisfied in practice. For instance, changes to an image-processing pipeline that occur during the operational phases of ongoing survey can substantially alter new measurements in comparison to those used to train the classifier. Another setting that faces similar challenges occurs when a new survey or instrument comes online, as annotated observations are typically in short supply in the early phases of a new campaign. In both of the aforementioned settings, the training data is often not sufficiently representative, and as a result, the data triage system will perform poorly when applied to the new observations.
These issues are typically resolved by waiting for new data to be collected and labeled, and simply retraining the classifier. However, this causes productivity delays and ignores the wealth of annotated data from earlier campaigns.
When a representative training set is not available, a machine learning technique known as domain adaptation provides a more attractive solution. Domain adaptation techniques compute a mapping between data from a “source domain” and data from a related “target domain,” each captured in similar -- but not identical – measurement regimes, that reconciles differences between the domains. This mapping allows a classifier trained using annotated observations collected during earlier campaigns with similar science objectives to generate robust predictions for new observations. Applying a domain adaptation technique can reduce or eliminate the productivity delays data triage systems experience while waiting for new observations to be annotated, thereby increasing the potential for science return for ongoing surveys, and also for new surveys where annotated archival data is available.
We evaluate domain adaptation techniques as a solution for data triage systems lacking a representative training set. We consider an instance of “data shift” caused by a data pipeline upgrade when iPTF succeeded the earlier Palomar Transient Factory (PTF) in January 2013. An immediate consequence of the data shift was that several measurements that were discriminative for PTF imagery became less informative for iPTF imagery, resulting in a substantial decrease in prediction accuracy. We provide illustrative examples of the measurements that experienced data shift, and review the implications of applying a classifier across differing measurement regimes. We show that domain adaptation techniques substantially improve Real-Bogus prediction accuracy across the PTF and iPTF measurement regimes. Our results suggest that a similar approach can be beneficial to bootstrap data triage systems for future surveys such as the Zwicky Transient Factory.
Authors: Brian D. Bue, Umaa D. Rebbapragada
Knowledge Discovery from the Hyperspectral Sky [2.32 MB PDF]
Erzsebet Merenyi
“Big Data” can connote large data volume, high feature-space dimension, and complex data structure, in any combination. This talk will focus on the complexity aspect of data – both big and small - because information extraction algorithms that work well for data of relatively simple structure often break down on “highly structured” data.
High feature-space dimension tends to increase complexity by virtue of the large number of relevant descriptors that allow discrimination among many different clusters of objects. Compared to broad-band, multispectral data (in planetary astronomy) or single-line observations (in stellar astronomy), hyperspectral data bring a jump in the complexity of spectral patterns and the cluster structure, and consequently in the analysis challenges for information discovery and extraction tasks such as clustering, classification, parameter inference, or dimensionality reduction. Many classical favorite techniques fail these challenges if one’s aim is to fully exploit the rich, intricate information captured by the sensor, ensure discovery of surprising small anomalies, and more. In stellar astronomy, where Ångström resolution is typical, the data complexity can grow even higher. With the advent of 21st century observatories such as ALMA, high spatial and spectral resolution image cubes with thousands of bands are extending into new and wider wavelength domains, adding impetus to develop increasingly powerful and efficient knowledge extraction tools.
I will present applications of brain-like machine learning, specifically advanced forms of neural maps that mimic analogous behaviors in natural neural maps in brains (e.g., preferential attention to rare signals, to enhance discovery of small clusters). I will give examples of structure discoveriy from hyperspectral data in planetary and radio astronomy, and point out advantages over more traditional techniques.
Authors: Erzsébet Merényi
Streaming Algorithms for Optimal Combination of Images and Catalogues [1.3 MB PDF]
Tamas Budavari
Modern astronomy is increasingly relying on large surveys, whose dedicated telescopes tenaciously observe the sky every night. The stream of data is often just collected to be analyzed in batches before each release of a particular project. Processing such large amounts of data is not only inefficient computationally it also introduces a significant delay before the measurements become available for scientific use. We will discuss algorithms that can process data as soon as they become available and provide incrementally improving results over time. In particular we will focus on two problems: (1) Repeated exposures of the sky are usually convolved to the worst acceptable quality before coadding for high signal-to-noise. Instead one can use (blind) image deconvolution to retain high resolution information, while still gaining signal-to-noise ratio. (2) Catalogs (and lightcurves) are traditionally extracted in apertures obtained from deep coadds after all the exposures are taken. Alternatively incremental aggregation and probabilistic filtering of intermediate catalogs could provide immediate access to faint sources during the life of a survey.
Authors: Tamas Budavari
Two applications of machine learning to data from galaxy surveys [4.93 MB PDF]
Viviana Acquaviva
I will present two applications of machine learning to data from galaxy surveys. In the first one, we use clustering to group together galaxies with similar spectral energy distribution. The two main applications are safely stacking SEDs with S/N too low to allow individual analysis, and save CPU time in SED fitting. In the second one, we use Support Vector Machines to improve the classification of high/low redshift objects in the HETDEX survey, and show how it lowers contamination and improves completeness with respect to previously used techniques.
Authors: Viviana Acquaviva, Eric Gawiser, Mario Martin, Ashwin Satyanarayana
Toyz for Data Analysis [1.15 MB PDF]
Fred Moolekamp
Toyz is a nearly completed open source python package designed for big data analysis, particularly for large data sets stored on an external server or supercomputer. The package runs a python Tornado web server and allows a user to view and interact with the data via a web browser. Most of the data processing is done on the server, running scripts written in python, C/C++, Fortran, R, or any other language that has a python wrapper. The user is then able to use tools written in html5/javascript to interact with the data including viewing multiple plots to display high dimensional data, a FITS viewer for large images (including DECam stacked images), astronomical data analysis via the Astropy and Astro Toyz packages, and customizable pipeline interfaces that walk other researchers/students through a set data reduction process. In addition to the default tools included in the package, custom scripts and packages built on the Toyz framework can be created to perform additional analysis. For example, additional image processing (such as psf photometry) can also be implemented if additional software such as IRAF/PyRAF or SExtractor is installed on the server.
Authors: Fred Moolekamp and Eric Mamajek
Hunting the Rarest of the Rare: From PS1 to LSST [3.45 MB PDF]
Gautham Narayan
The CfA/JHU transient science client operated on Pan-STARRS 1 (PS1) Medium Deep Survey (MDS) images from 2010-4, and discovered over 5000 supernovae, hundreds of which were followed up spectroscopically. I will discuss our experience adapting the ESSENCE/SuperMACHO pipeline (operating on 15 TB of images, over 6 years) to work on PS1-MDS (800+ TB of images over 4 years), with a particular emphasis on difference imaging, artifact and variable rejection, and catalog cross-matching.
The data volume from LSST will be several times that from PS1, but these problems will remain challenging. The previous generation of variable and transient surveys were designed to detect type Ia supernovae for cosmological analysis. LSST is designed to serve a much broader community of astronomers, with varied interests, and many of the variables and transients it will find have never been seen before. I’ll discuss our work on the ANTARES project, and how we are using our experience with supernova searches to tackle the more general problem of characterizing the entire transient and variable sky. Our prototype is focused on identifying the “rarest of the the rare” events in real-time to coordinate detailed follow-up studies, but we must accurately characterize known objects with sparse data to separate the wheat from the chaff. I’ll detail some of the new algorithms being developed for the project, the more complex architecture we need to accomplish this more ambitious goal, and present some of our preliminary results using existing data sets.
Authors: Gautham Narayan
MyMergerTree: A Cloud Service for Creating and Analyzing Galactic Merger Trees [1007 KB PDF]
Sarah Loebman
I will present the motivation, design, implementation, and preliminary evaluation for a service that enables astronomers to study the growth history of galaxies by following their ‘merger trees’ in large-scale astrophysical simulations. The service uses the Myria parallel data management system as back-end and the D3 data visualization library within its graphical front-end. I will demonstrate this MyMergerTree service at the conference on a ∼5 TB dataset and discuss the future of such analyses at scale.
Authors: S. Loebman
Enhancing the Legacy of HST Spectroscopy in the Era of Big Data [6.07 MB PDF]
Alessandra Aloisi
In the era of large astronomical data, data-driven multi-wavelength science will play an increasing role in Astronomy over the next decade as surveys like WFIRST/AFTA and LSST become realities. The Mikulski Archive for Space Telescope (MAST), located at the Space Telescope Science Institute, is a NASA funded project to support and provide to the astronomical community a variety of astronomical data archives with the primary focus on scientifically related datasets in the optical, ultraviolet, and near-infrared parts of the spectrum. WIthin MAST a lot of attention has been devoted in the past to increase the discovery, improve the access, and create high-level science products for imaging data, while spectroscopic science products are presently very limited and primarily related to grism spectra. STScI has recently embarked in an effort to remedy this situation and to implement a number of possible enhancements to the Hubble Space Telescope archive that would make spectroscopic data more useful to the scientific community. Details of this effort will be given, including the development of algorithms for combining spectra, the definition of new high-level science products, the consolidation of existing visualization tools for spectra into the MAST portal, the implementation of discovery tools for spectroscopy, and the creation of tools for spectral feature identification and measurement. These enhancements will help the science community to tap the latent science potential of Hubble's archival spectroscopic data in many years to come.
Authors: Alessandra Aloisi
Wide-Field Radio Astronomy and the Dynamic Universe [4.74 MB PDF]
Bryan Gaensler
At radio wavelengths, tremendous signal-processing overheads have left the time-varying sky largely unexplored. However, a suite of new wide-field telescopes are now beginning to take data over a wide range of observing frequencies, time resolutions and cadences. These efforts will soon result in a comprehensive, deep, census of the variable and transient radio sky. In this talk, I will review the analysis techniques, algorithms and machine learning approaches that are allowing us to extract a new view of the time-variable sky from large and complex data sets, and will conclude by highlighting the unexpected behavior and newly identified source classes that these surveys have so far unveiled.
Authors: Bryan Gaensler
Probabilistic photometric redshift in the era of petascale astronomy [4.49 MB PDF]
Matias Carrasco Kind
Photometric redshifts (photo-z) are quickly becoming a critical measurement for large photometric galaxy surveys and there has been a significant development in this area in the last decade. Given the enormous amount of imaging data we are expecting to be available in the upcoming years, there is a need for fast, robust and more complex algorithms to compute and store not only a photo-z single estimate but also its probability density function.
In this talk I will review some of the state-of-the-art machine learning and data mining algorithms to compute, combine and store photo-z PDFs for large galaxy surveys. I will discuss how, a supervised machine learning technique , a unsupervised technique and a standard template fitting approach can be combined together in a Bayesian framework to improve the accuracy of the photo-z PDF as well as reducing the fraction of outliers in a completely new approach. I will also discuss how we can reduce the amount of storage of individual PDFs by using sparse representation techniques with a compression rate of 90% with an reconstruction accuracy of 99.8% using only 10 to 20 4-byte integers per galaxy which will allow us to store several techniques in a reduced amount of space which is becoming critical for current photometric surveys as several hundred million objects are expected. I will finalize by showing recent progress on how to implement and access these photo-z PDF from a SQL database using advance techniques allowing us for the first time to deal with these objects directly from the database and in fast and efficient manner with no precedents.
Authors: Matias Carrasco Kind + DESDM NCSA team
Case study: Classifying High Redshift Quasars on the LSST-Reprocessed SDSS Stripe 82 Imaging [6.13 MB PDF]
Yusra AlSayyad
To evaluate methods for extracting information from multi-epoch imaging, the LSST data management team is testing the prototype pipeline, which converts pixels into catalogs, on simulated images and existing time domain surveys. A 250 sq. deg. repeatedly-imaged stripe of the SDSS called Stripe 82 provides a strong test dataset because it was obtained under skies with highly variable conditions. Reprocessing the 20TB of imaging consisted of detecting sources on i-band co-adds but measuring flux on the single-epoch (ugriz) images, generating complete and less biased lightcurves for sources fainter than the single-epoch detection threshold. I will describe the technical challenges of aggregating these 16 billion photometry measurements into lightcurve metrics for each of the 40 million objects. I will also discuss our application of ensemble classifiers to this dataset to select z~4 quasars with care to maintain the ability to estimate the completeness of the sample when the underlying population has a complex distribution in feature space.
Authors: Yusra AlSayyad, Ian McGreer, Andy Connolly, Xiaohui Fan, Zeljko Ivezic, Andy Becker, Mario Juric, and the LSST Data Management team
Photometric classification of QSOs from RCS2 using Random Forest [2.92 MB PDF]
Felipe Barrientos
We describe the construction of a quasar catalog containing 91,842 candidates derived from analysis of imaging data with a Random Forest algorithm. Using spectroscopically-confirmed stars and quasars from the SDSS as a training set, we blindly search the RCS-2 ( 750 deg2) imaging survey. From a source catalogue of 1,863,970 RCS-2 point sources, our algorithm identifies putative quasars from broadband magnitudes (g, r, i, and z) and colours. By adding WISE and/or GALEX photometry we improve our candidate lists obtaining a precision and recall of up to 99.3% and 99.2%, respectively. Spectroscopy for a small sample confirms our expectations.
Authors: F. Barrientos, D. Carrasco, T. Anguita, K. Pichara, et al
Challenges in producing and using large spectroscopic and photometric redshift datasets [3.22 MB PDF]
Benjamin Weiner
Next generation surveys aim to measure extremely large samples of both spectroscopic and photometric redshifts. These will be used both to make precision cosmological tests and to study the properties and evolution of galaxies. Different use cases, for example measuring clustering vs. measuring a luminosity function, have substantially different requirements on the completeness and reliability of the redshifts, and on the supplementary information required to use them. The goals of the survey team and community users will often diverge and this should be considered when specifying deliverable data products. I will discuss the problem of classifying spectroscopic redshifts in deep, low-S/N surveys. For example, in the DEEP2 survey we had ~50,000 spectra yielding ~30,000 good redshifts, and were able to fit the redshifts automatically, but had to classify quality by eye. This contrasts with higher-S/N surveys like SDSS, and clearly does not scale to the next generation of very large surveys. Machine learning techniques may improve this problem. I will also discuss the calibration and use of photometric redshifts and the statistical techniques astronomers will need for measurements using photo-z distributions. In particular, I suggest that we need to move beyond simple treatments of photo-z errors. Furthermore, inference from photometric redshift catalogs will require more than simply a likelihood function P(z) for each galaxy, and this will be critical for LSST.
Authors: Benjamin Weiner
Characterizing the variable sky with CRTS [20.12 MB PDF]
Matthew Graham
The Catalina Real-time Transient Survey covers 80% of the sky to a magnitude limit of V~20 and has light curves for 500 million objects with typically 250 observations over a baseline of almost ten years. It is an unprecedented data set for understanding the temporal behaviour of many classes of astrophysical phenomena. It also serves as an excellent testbed on which to develop the tools and techniques necessary to work with future data intensive projects such as LSST and SKA. In this talk, I will review what CRTS is telling us about the variable sky and particularly how it is providing new insight into the physics of quasars.
Authors: Matthew Graham, S. G. Djorgovski, A. Drake, A. Mahabal, C. Donalek
Enabling Scalable Data Analytics for LSST and Beyond through Qserv [29.63 MB PDF]
Jacek Becla
The construction of the Large Synoptic Survey Telescope (LSST) is starting this year. Its large and rich data set will enable exploration of a wide range of astrophysical questions, ranging from discovering “killer” asteroids, to examining the nature of dark energy. The exploration will involve running a wide range of spatial and temporal analyses, including computationally intensive scientific investigations in multi-dimensional space over the entire data set. Over a million analyses, from trivial to very complex, are expected daily, every day, for decades. A specialized database system is developed to make that not only possible, but also fast and efficient. The talk will describe the driving requirements, assumptions made, and design trade offs that were considered in developing the baseline architecture for the LSST database. Details of its prototype implementation, Qserv, will be given, highlighting its features, status, and potential for usage outside LSST and even outside astronomy.
Authors: Jacek Becla
Will Your Hypercubes Eat My Asterozoa? Building a Versatile Data Archive for the SAMI IFU Galaxy Survey. [6.54 MB PDF]
Iraklis Konstantopoulos
It turns out that data are just data. So, with only a little effort, your 5D hypercubes will be safe in the same tank as those version-controlled tables. This is the assurance I gave the SAMI team when delivering the original specification for the data archive, albeit slightly paraphrased. The SAMI Galaxy Survey will ultimately collect spatially resolved spectroscopy of ≈3400 galaxies. While managing the volume of data is not in itself a tall order, the nature of the information stored presents an opportunity for astronomers to set up creative information access methods. The primary product being a datacube, that is, a matrix of spectra, one should be free to slice and dice at will, and do so directly on the data archive, rather than being restricted by having to download thousands of individual data files. This is what we have set up for SAMI. A data archive, query engine, and database, all rolled into one, using the HDF5 filesystem through a set of Python codes. I will be presenting an overview of this open source software, which we hope will be of use to other scientists who seek a simple and versatile data access mechanism. I will also review the use case for the Starfish diagram, which was developed to succinctly and simultaneously convey all sorts of information of an individual galaxy (database entry) and the sample in its entirety.
Authors: Iraklis Konstantopoulos
The StArchive: An Open Stellar Archive [16.19 MB PDF]
Angelle Tanner
Historically, astronomers have utilized a piecemeal set of archives such as SIMBAD, the Washington Double Star Catalog, various exoplanet encyclopedias and electronic tables from the literature to cobble together stellar and planetary parameters with the absence of corresponding images and spectra. The mothballed NStED archive was in the process of collecting such data on nearby stars but its course may have changed if it comes back to NASA mission specific targets and NOT a volume limited sample of nearby stars. This means there is void. A void in the available set of tools many exoplanet astronomers would appreciate to create comprehensive lists of the stellar parameters of stars in our local neighborhood. Also, we need better resources for downloading adaptive optics images and published spectra to help confirm new discoveries and find ideal target stars. With so much data being produced by the stellar and exoplanet community we have decided to propose for the creation the Starchive - an open access stellar archive in the spirit of the open exoplanet catalog, the Kepler Community Follow-up Program and many other existing archives. While we will highly regulate and constantly validate the data being placed into our archive the open nature of its design is intended to allow the database to be updated quickly and have a level of versatility which is necessary in today's fast moving, big data exoplanet community. Here, I will introduce the community to the content and expected capabilities of the archive and query the audience for community feedback.
Authors: Tanner, A. et al.
Reducing Large Imaging Data Sets on the Amazon Elastic Compute Cloud [16.68 MB PDF]
Ben Williams
The PHAT survey 6-band UV-IR photometry of over 100 million stars was performed on the Amazon Elastic Compute Cloud. Using this resource for parallel reduction of this large data set required overcoming several technical obstacles regarding security, file sharing, network communication, error trapping, and more. I will discuss how these issues were overcome in order to produce an efficient automated pipeline on EC2, as well as the advantages of using this resource compared to a dedicated compute cluster.
Authors: Ben Williams
Creating A Multiwavelength Galactic Plane Atlas With Amazon Web Services [17.3 MB PDF]
Bruce Berriman
We describe by example how astronomers can optimize cloud-computing resources offered by Amazon Web Services (AWS) to create, curate and serve new datasets at scale. We have produced an atlas of the Galactic Plane at 16 wavelengths from 1 micron to 24 microns with a spatial sampling of 1 arcsec. The atlas has been created using the Montage mosaic engine to generate co-registered mosaics of images released by the major surveys WISE, 2MASS, MSX, GLIMPSE and MIPSGAL. The full atlas is 59 TB in size, and composed of over 9,600 5 degree x 5 degree tiles with one degree overlap between them. The dataset is housed on the Amazon S3 storage platform, designed for at-scale storage with access via web protocols. It is accessible through a prototype web-form and API that will support access to the data according to the users’ query specifications. When the interface is complete, the data set will be made public.
The processing required 340,000 compute hours for completion, carried out on virtual clusters created and managed on AWS platforms through the Pegasus workflow management system. We will describe the optimization methods, compute time and processing costs, as a guide for others wishing to take advantage of cloud platforms for processing and data creation. Our goal is to use the expertise developed here to develop a set of “turnkey” tools, based on Open Source technologies, that will enable scientist to perform processing at scale on cloud platforms with minimal system management knowledge. This is part of a major upgrade in the capabilities of Montage, which will include support for image cubes and data organized in the HEALPix partitioning scheme
Authors: G. B. Berriman, J. C. Good, M. Rynge, E. Deelman, G. Juve, J. Kinney, A. Merrihew
Trident: Custom Scalable Scientific Compute Archive and Analysis [6.56 MB PDF]
Arvind Gopu
As imaging systems improve, the size of astronomical data has continued to grow, making the transfer, secure storage, processing, and analysis of data a significant burden. To solve this problem for the WIYN Observatory One Degree Imager (ODI), we developed the ODI–Pipeline, Portal, and Archive (ODI-PPA), a completely web based solution that provides astronomers a modern user interface and acting as a single point of access to their data, and rich computational and visualization capabilities. It supports scientists in handling complex data sets, while enhancing WIYN's scientific productivity beyond data acquisition.
ODI-PPA is designed to be a Scalable Compute Archive (SCA) that has built-in frameworks including: (1) Collections that allow an astronomer to create logical collations of data products intended for publication, further research, instructional purposes, or to execute data processing tasks (2) Image Explorer and Source Explorer, which together enable real-time interactive visual analysis of massive astronomical data products within an HTML5 capable web browser, and overlaid standard catalog and Source Extractor-generated source markers (3) Workflow framework which enables rapid integration of data processing pipelines on an associated compute cluster and users to request such pipelines to be executed on their data via custom user interfaces.
Once ODI-PPA became operational, we identified other research groups and centers that may benefit from having an SCA system for their users and data. We prototyped a PPA-like system for the Ludwig Maximilian University's Wide Field Imager, and then we adapted our code toward developing a powerful analysis/visualization portal for Globus Cluster System (GCS) data collected by IU researchers in over a decade. Thus far, we have also adapted our code to the IU Electron Microscopy Center's data (EMC-PPA).
Our umbrella project codenamed Trident encompasses ODI-PPA and its offshoots EMC-PPA, LMU-PPA, and the GCS portal. It is made up of several light-weight services connected by a message bus; the web portal built using Twitter/Bootstrap, AngularJS and jQuery JavaScript libraries, and backend services written in PHP (using the Zend framework) and Python; it leverages supercomputing and storage resources at Indiana University. With data volumes and computational needs increasing steadily, and perhaps exponentially in some cases like LSST, we believe that Trident based SCA systems are designed to be reconfigurable for use in any science domain with large and complex datasets with the need for cloud or grid based computational and storage resources.
Authors: Gopu, Hayashi, Young, Henschel, Kotulla, et al.
Handling the Dark Energy Survey [7.59 MB PDF]
Ignacio Sevilla
The Dark Energy Survey is well into the processing of its second year of survey data. In the meantime, scientists have been working through the Science Verification dataset and Year 1 to start producing its first scientific results. In this contribution, we summarize the challenges and current solutions being used by the collaboration, with emphasis on tools for data analysis and visualization, as well as how groups have effectively coordinated in dealing with the large amount of data coming out of the project.
Authors: Ignacio Sevilla for the Dark Energy Survey Collaboration
Efficient data reduction and analysis of DECam images in a multicore server [5.66 MB PDF]
Roberto Munoz
We have embarked on a panchromatic survey of the Fornax galaxy cluster using the DECam instrument at the Blanco telescope. The survey consists of ugriz-band photometry of nine contiguous tiles of the central Fornax cluster region, covering a total region of 27 deg^2. The first challenge we faced was not having enough disk space to store the raw and processed images, and also having slow access to the data when using single drives. We solved both by using a commercial RAID system that consists of 18 HDD drives that offers a storage capacity of 38 TB and delivers a transfer rate of 1,200 MB/s. The second challenge was developing efficient algorithms for doing the registering and stacking of hundred of images using a multicore single server. We solved this by using a combination of the Astromatic software packages, and a set of IDL and Python routines. Last but not least, I will discuss several advanced methods for modeling and subtracting the sky from DECam surveys involving large galaxies in high density environments, and will present a catalog of new dwarf galaxies in Fornax that were discovered using a machine learning algorithm.
Authors: Roberto P. Munoz, T. H. Puzia, M. Taylor, P. Eigenthaler
Wide Data vs. Big Data [28.23 MB PDF]
Alyssa Goodman
In life, more of the same isn't always better. Often, variety is more important. In Science, using more kinds of data, rather than just more data, often helps answer the hardest questions. In this talk, I will discuss tools for visualizing several data sets at once. In some cases, the visualizations are straightforward (such as overlaying layers imagery or catalog data on images, in tools like WorldWide Telescope or Aladin), but in other cases, more subtle "linked-view" visualization amongst data sets and visualization types yields the deepest insight (e.g. using tools like Glue). The talk will include demonstrations of software using real data sets, both large and small, and will illustrate why data diversity is at least as important a concept as data volume.
Authors: Alyssa Goodman
How do you look at a billion data points? Exploratory Visualization for Big Data [1.16 MB PDF]
Carlos Scheidegger
Consider exploration of large multidimensional spatiotemporal datasets with billions of entries. Are certain attributes correlated spatially or temporally? How do we even look at data of this size? In this talk, I will present the techniques and algorithms to compute and query a nanocube, a data structure that enables interactive visualizations of data sources in the range of billions of elements.
Data cubes are widely used for exploratory data analysis. Although they are sometimes assumed to take a prohibitively large amount of space (and to consequently require disk storage), nanocubes fit in a modern laptop's main memory, even for hundreds of millions of entries. I will present live demos of the technique on a variety of real-world datasets, together with comparisons to the previous state of the art with respect to memory, timing, and network bandwidth measurements.
Authors: Carlos Scheidegger
Machine Vision Methods for the Diffuse Universe [29.48 MB PDF]
Joshua Peek
Data science in astronomy usually focuses on tabular data: rows and columns. This methodology often matches well to the science of astronomy, as we often examine a black sky containing individual, easy to discretize objects like stars and galaxies. However, to understand how these objects form we must investigate the spatial and kinematic structure of the diffuse universe, which holds the vast majority of baryons. I will discuss a number of machine-vision methods for interpreting the diffuse sky, with a focus on the interstellar medium. I will show that morphological information in the diffuse universe can be recovered, and that it can be used to predict and measure underlying physical quantities.
Authors: J. E. G. Peek
Visualization an Analysis of Rich Spectral-Line Datasets [4.29 MB PDF]
Elisabeth Mills
The visualization and analysis of ever more complicated multi-dimensional data is the limiting factor in our ability to exploit the capabilities of modern radio interferometers. Traditional methods for the understanding of radio data cubes do not scale well to the much richer data sets of today. The challenges these data sets present will only grow worse as our instruments grow more sensitive and the data volumes become larger. Previously specialized cases, such as dealing with complex velocity structures and extreme spectral complexity will soon become the norm, requiring new approaches. There are various projects currently addressing the need for improved visualization tools, however, most of these focus on trying to deal with larger data volumes with existing techniques. To augment these, we have begun a collaboration with the Scientific Computing and Imaging Institute at the University of Utah to develop new techniques to assist in the analysis of rich spectral line data sets that take advantage of previously developed techniques for visualizing high-dimensional datasets. Ultimately, the goal is to reduce the cognitive load on astronomers by maintaining and intuitively displaying the relations among the spatial regions, frequency and velocity axis and molecular species, to facilitate a physical understanding of the astronomical region.
Authors: E.A.C. Mills, J. Kern, J. Corby, B. Kent
Leveraging Annotated Archival Data with Domain Adaptation to Improve Data Triage in Optical Astronomy [1.49 MB PDF]
Brian Bue
Recent efforts have demonstrated the potential of deploying automated “data triage” systems in the science data processing pipelines of optical astronomical surveys. Of these, the astronomical transient detection pipeline at intermediate Palomar Transient Factory (iPTF) is a notable success. Rather than relying on human eyes to examine and analyze all collected data, iPTF relies upon a “Real-Bogus” classifier to vet candidate transient sources detected by its image subtraction pipeline. An effective Real-Bogus classifier filters out bogus candidates (e.g., image artifacts), and preserves candidates that potentially represent real astronomical transients, and allows domain experts to focus attention on observations worthy of spectroscopic follow-up, while reducing the time and effort necessary to manually filter out false detections.
In general, deploying an effective data triage system demands that (1) a representative set of annotated data be available to train the classifier, and (2) new data be measured in a similar fashion as the training data. However, these conditions are often not satisfied in practice. For instance, changes to an image-processing pipeline that occur during the operational phases of ongoing survey can substantially alter new measurements in comparison to those used to train the classifier. Another setting that faces similar challenges occurs when a new survey or instrument comes online, as annotated observations are typically in short supply in the early phases of a new campaign. In both of the aforementioned settings, the training data is often not sufficiently representative, and as a result, the data triage system will perform poorly when applied to the new observations.
These issues are typically resolved by waiting for new data to be collected and labeled, and simply retraining the classifier. However, this causes productivity delays and ignores the wealth of annotated data from earlier campaigns.
When a representative training set is not available, a machine learning technique known as domain adaptation provides a more attractive solution. Domain adaptation techniques compute a mapping between data from a “source domain” and data from a related “target domain,” each captured in similar -- but not identical – measurement regimes, that reconciles differences between the domains. This mapping allows a classifier trained using annotated observations collected during earlier campaigns with similar science objectives to generate robust predictions for new observations. Applying a domain adaptation technique can reduce or eliminate the productivity delays data triage systems experience while waiting for new observations to be annotated, thereby increasing the potential for science return for ongoing surveys, and also for new surveys where annotated archival data is available.
We evaluate domain adaptation techniques as a solution for data triage systems lacking a representative training set. We consider an instance of “data shift” caused by a data pipeline upgrade when iPTF succeeded the earlier Palomar Transient Factory (PTF) in January 2013. An immediate consequence of the data shift was that several measurements that were discriminative for PTF imagery became less informative for iPTF imagery, resulting in a substantial decrease in prediction accuracy. We provide illustrative examples of the measurements that experienced data shift, and review the implications of applying a classifier across differing measurement regimes. We show that domain adaptation techniques substantially improve Real-Bogus prediction accuracy across the PTF and iPTF measurement regimes. Our results suggest that a similar approach can be beneficial to bootstrap data triage systems for future surveys such as the Zwicky Transient Factory.
Authors: Brian D. Bue, Umaa D. Rebbapragada
Knowledge Discovery from the Hyperspectral Sky [2.32 MB PDF]
Erzsebet Merenyi
“Big Data” can connote large data volume, high feature-space dimension, and complex data structure, in any combination. This talk will focus on the complexity aspect of data – both big and small - because information extraction algorithms that work well for data of relatively simple structure often break down on “highly structured” data.
High feature-space dimension tends to increase complexity by virtue of the large number of relevant descriptors that allow discrimination among many different clusters of objects. Compared to broad-band, multispectral data (in planetary astronomy) or single-line observations (in stellar astronomy), hyperspectral data bring a jump in the complexity of spectral patterns and the cluster structure, and consequently in the analysis challenges for information discovery and extraction tasks such as clustering, classification, parameter inference, or dimensionality reduction. Many classical favorite techniques fail these challenges if one’s aim is to fully exploit the rich, intricate information captured by the sensor, ensure discovery of surprising small anomalies, and more. In stellar astronomy, where Ångström resolution is typical, the data complexity can grow even higher. With the advent of 21st century observatories such as ALMA, high spatial and spectral resolution image cubes with thousands of bands are extending into new and wider wavelength domains, adding impetus to develop increasingly powerful and efficient knowledge extraction tools.
I will present applications of brain-like machine learning, specifically advanced forms of neural maps that mimic analogous behaviors in natural neural maps in brains (e.g., preferential attention to rare signals, to enhance discovery of small clusters). I will give examples of structure discoveriy from hyperspectral data in planetary and radio astronomy, and point out advantages over more traditional techniques.
Authors: Erzsébet Merényi
Streaming Algorithms for Optimal Combination of Images and Catalogues [1.3 MB PDF]
Tamas Budavari
Modern astronomy is increasingly relying on large surveys, whose dedicated telescopes tenaciously observe the sky every night. The stream of data is often just collected to be analyzed in batches before each release of a particular project. Processing such large amounts of data is not only inefficient computationally it also introduces a significant delay before the measurements become available for scientific use. We will discuss algorithms that can process data as soon as they become available and provide incrementally improving results over time. In particular we will focus on two problems: (1) Repeated exposures of the sky are usually convolved to the worst acceptable quality before coadding for high signal-to-noise. Instead one can use (blind) image deconvolution to retain high resolution information, while still gaining signal-to-noise ratio. (2) Catalogs (and lightcurves) are traditionally extracted in apertures obtained from deep coadds after all the exposures are taken. Alternatively incremental aggregation and probabilistic filtering of intermediate catalogs could provide immediate access to faint sources during the life of a survey.
Authors: Tamas Budavari
Two applications of machine learning to data from galaxy surveys [4.93 MB PDF]
Viviana Acquaviva
I will present two applications of machine learning to data from galaxy surveys. In the first one, we use clustering to group together galaxies with similar spectral energy distribution. The two main applications are safely stacking SEDs with S/N too low to allow individual analysis, and save CPU time in SED fitting. In the second one, we use Support Vector Machines to improve the classification of high/low redshift objects in the HETDEX survey, and show how it lowers contamination and improves completeness with respect to previously used techniques.
Authors: Viviana Acquaviva, Eric Gawiser, Mario Martin, Ashwin Satyanarayana
Toyz for Data Analysis [1.15 MB PDF]
Fred Moolekamp
Toyz is a nearly completed open source python package designed for big data analysis, particularly for large data sets stored on an external server or supercomputer. The package runs a python Tornado web server and allows a user to view and interact with the data via a web browser. Most of the data processing is done on the server, running scripts written in python, C/C++, Fortran, R, or any other language that has a python wrapper. The user is then able to use tools written in html5/javascript to interact with the data including viewing multiple plots to display high dimensional data, a FITS viewer for large images (including DECam stacked images), astronomical data analysis via the Astropy and Astro Toyz packages, and customizable pipeline interfaces that walk other researchers/students through a set data reduction process. In addition to the default tools included in the package, custom scripts and packages built on the Toyz framework can be created to perform additional analysis. For example, additional image processing (such as psf photometry) can also be implemented if additional software such as IRAF/PyRAF or SExtractor is installed on the server.
Authors: Fred Moolekamp and Eric Mamajek
Hunting the Rarest of the Rare: From PS1 to LSST [3.45 MB PDF]
Gautham Narayan
The CfA/JHU transient science client operated on Pan-STARRS 1 (PS1) Medium Deep Survey (MDS) images from 2010-4, and discovered over 5000 supernovae, hundreds of which were followed up spectroscopically. I will discuss our experience adapting the ESSENCE/SuperMACHO pipeline (operating on 15 TB of images, over 6 years) to work on PS1-MDS (800+ TB of images over 4 years), with a particular emphasis on difference imaging, artifact and variable rejection, and catalog cross-matching.
The data volume from LSST will be several times that from PS1, but these problems will remain challenging. The previous generation of variable and transient surveys were designed to detect type Ia supernovae for cosmological analysis. LSST is designed to serve a much broader community of astronomers, with varied interests, and many of the variables and transients it will find have never been seen before. I’ll discuss our work on the ANTARES project, and how we are using our experience with supernova searches to tackle the more general problem of characterizing the entire transient and variable sky. Our prototype is focused on identifying the “rarest of the the rare” events in real-time to coordinate detailed follow-up studies, but we must accurately characterize known objects with sparse data to separate the wheat from the chaff. I’ll detail some of the new algorithms being developed for the project, the more complex architecture we need to accomplish this more ambitious goal, and present some of our preliminary results using existing data sets.
Authors: Gautham Narayan
MyMergerTree: A Cloud Service for Creating and Analyzing Galactic Merger Trees [1007 KB PDF]
Sarah Loebman
I will present the motivation, design, implementation, and preliminary evaluation for a service that enables astronomers to study the growth history of galaxies by following their ‘merger trees’ in large-scale astrophysical simulations. The service uses the Myria parallel data management system as back-end and the D3 data visualization library within its graphical front-end. I will demonstrate this MyMergerTree service at the conference on a ∼5 TB dataset and discuss the future of such analyses at scale.
Authors: S. Loebman
Enhancing the Legacy of HST Spectroscopy in the Era of Big Data [6.07 MB PDF]
Alessandra Aloisi
In the era of large astronomical data, data-driven multi-wavelength science will play an increasing role in Astronomy over the next decade as surveys like WFIRST/AFTA and LSST become realities. The Mikulski Archive for Space Telescope (MAST), located at the Space Telescope Science Institute, is a NASA funded project to support and provide to the astronomical community a variety of astronomical data archives with the primary focus on scientifically related datasets in the optical, ultraviolet, and near-infrared parts of the spectrum. WIthin MAST a lot of attention has been devoted in the past to increase the discovery, improve the access, and create high-level science products for imaging data, while spectroscopic science products are presently very limited and primarily related to grism spectra. STScI has recently embarked in an effort to remedy this situation and to implement a number of possible enhancements to the Hubble Space Telescope archive that would make spectroscopic data more useful to the scientific community. Details of this effort will be given, including the development of algorithms for combining spectra, the definition of new high-level science products, the consolidation of existing visualization tools for spectra into the MAST portal, the implementation of discovery tools for spectroscopy, and the creation of tools for spectral feature identification and measurement. These enhancements will help the science community to tap the latent science potential of Hubble's archival spectroscopic data in many years to come.
Authors: Alessandra Aloisi
Wide-Field Radio Astronomy and the Dynamic Universe [4.74 MB PDF]
Bryan Gaensler
At radio wavelengths, tremendous signal-processing overheads have left the time-varying sky largely unexplored. However, a suite of new wide-field telescopes are now beginning to take data over a wide range of observing frequencies, time resolutions and cadences. These efforts will soon result in a comprehensive, deep, census of the variable and transient radio sky. In this talk, I will review the analysis techniques, algorithms and machine learning approaches that are allowing us to extract a new view of the time-variable sky from large and complex data sets, and will conclude by highlighting the unexpected behavior and newly identified source classes that these surveys have so far unveiled.
Authors: Bryan Gaensler
Probabilistic photometric redshift in the era of petascale astronomy [4.49 MB PDF]
Matias Carrasco Kind
Photometric redshifts (photo-z) are quickly becoming a critical measurement for large photometric galaxy surveys and there has been a significant development in this area in the last decade. Given the enormous amount of imaging data we are expecting to be available in the upcoming years, there is a need for fast, robust and more complex algorithms to compute and store not only a photo-z single estimate but also its probability density function.
In this talk I will review some of the state-of-the-art machine learning and data mining algorithms to compute, combine and store photo-z PDFs for large galaxy surveys. I will discuss how, a supervised machine learning technique , a unsupervised technique and a standard template fitting approach can be combined together in a Bayesian framework to improve the accuracy of the photo-z PDF as well as reducing the fraction of outliers in a completely new approach. I will also discuss how we can reduce the amount of storage of individual PDFs by using sparse representation techniques with a compression rate of 90% with an reconstruction accuracy of 99.8% using only 10 to 20 4-byte integers per galaxy which will allow us to store several techniques in a reduced amount of space which is becoming critical for current photometric surveys as several hundred million objects are expected. I will finalize by showing recent progress on how to implement and access these photo-z PDF from a SQL database using advance techniques allowing us for the first time to deal with these objects directly from the database and in fast and efficient manner with no precedents.
Authors: Matias Carrasco Kind + DESDM NCSA team
Case study: Classifying High Redshift Quasars on the LSST-Reprocessed SDSS Stripe 82 Imaging [6.13 MB PDF]
Yusra AlSayyad
To evaluate methods for extracting information from multi-epoch imaging, the LSST data management team is testing the prototype pipeline, which converts pixels into catalogs, on simulated images and existing time domain surveys. A 250 sq. deg. repeatedly-imaged stripe of the SDSS called Stripe 82 provides a strong test dataset because it was obtained under skies with highly variable conditions. Reprocessing the 20TB of imaging consisted of detecting sources on i-band co-adds but measuring flux on the single-epoch (ugriz) images, generating complete and less biased lightcurves for sources fainter than the single-epoch detection threshold. I will describe the technical challenges of aggregating these 16 billion photometry measurements into lightcurve metrics for each of the 40 million objects. I will also discuss our application of ensemble classifiers to this dataset to select z~4 quasars with care to maintain the ability to estimate the completeness of the sample when the underlying population has a complex distribution in feature space.
Authors: Yusra AlSayyad, Ian McGreer, Andy Connolly, Xiaohui Fan, Zeljko Ivezic, Andy Becker, Mario Juric, and the LSST Data Management team
Photometric classification of QSOs from RCS2 using Random Forest [2.92 MB PDF]
Felipe Barrientos
We describe the construction of a quasar catalog containing 91,842 candidates derived from analysis of imaging data with a Random Forest algorithm. Using spectroscopically-confirmed stars and quasars from the SDSS as a training set, we blindly search the RCS-2 ( 750 deg2) imaging survey. From a source catalogue of 1,863,970 RCS-2 point sources, our algorithm identifies putative quasars from broadband magnitudes (g, r, i, and z) and colours. By adding WISE and/or GALEX photometry we improve our candidate lists obtaining a precision and recall of up to 99.3% and 99.2%, respectively. Spectroscopy for a small sample confirms our expectations.
Authors: F. Barrientos, D. Carrasco, T. Anguita, K. Pichara, et al
Challenges in producing and using large spectroscopic and photometric redshift datasets [3.22 MB PDF]
Benjamin Weiner
Next generation surveys aim to measure extremely large samples of both spectroscopic and photometric redshifts. These will be used both to make precision cosmological tests and to study the properties and evolution of galaxies. Different use cases, for example measuring clustering vs. measuring a luminosity function, have substantially different requirements on the completeness and reliability of the redshifts, and on the supplementary information required to use them. The goals of the survey team and community users will often diverge and this should be considered when specifying deliverable data products. I will discuss the problem of classifying spectroscopic redshifts in deep, low-S/N surveys. For example, in the DEEP2 survey we had ~50,000 spectra yielding ~30,000 good redshifts, and were able to fit the redshifts automatically, but had to classify quality by eye. This contrasts with higher-S/N surveys like SDSS, and clearly does not scale to the next generation of very large surveys. Machine learning techniques may improve this problem. I will also discuss the calibration and use of photometric redshifts and the statistical techniques astronomers will need for measurements using photo-z distributions. In particular, I suggest that we need to move beyond simple treatments of photo-z errors. Furthermore, inference from photometric redshift catalogs will require more than simply a likelihood function P(z) for each galaxy, and this will be critical for LSST.
Authors: Benjamin Weiner
Characterizing the variable sky with CRTS [20.12 MB PDF]
Matthew Graham
The Catalina Real-time Transient Survey covers 80% of the sky to a magnitude limit of V~20 and has light curves for 500 million objects with typically 250 observations over a baseline of almost ten years. It is an unprecedented data set for understanding the temporal behaviour of many classes of astrophysical phenomena. It also serves as an excellent testbed on which to develop the tools and techniques necessary to work with future data intensive projects such as LSST and SKA. In this talk, I will review what CRTS is telling us about the variable sky and particularly how it is providing new insight into the physics of quasars.
Authors: Matthew Graham, S. G. Djorgovski, A. Drake, A. Mahabal, C. Donalek
Enabling Scalable Data Analytics for LSST and Beyond through Qserv [29.63 MB PDF]
Jacek Becla
The construction of the Large Synoptic Survey Telescope (LSST) is starting this year. Its large and rich data set will enable exploration of a wide range of astrophysical questions, ranging from discovering “killer” asteroids, to examining the nature of dark energy. The exploration will involve running a wide range of spatial and temporal analyses, including computationally intensive scientific investigations in multi-dimensional space over the entire data set. Over a million analyses, from trivial to very complex, are expected daily, every day, for decades. A specialized database system is developed to make that not only possible, but also fast and efficient. The talk will describe the driving requirements, assumptions made, and design trade offs that were considered in developing the baseline architecture for the LSST database. Details of its prototype implementation, Qserv, will be given, highlighting its features, status, and potential for usage outside LSST and even outside astronomy.
Authors: Jacek Becla
Will Your Hypercubes Eat My Asterozoa? Building a Versatile Data Archive for the SAMI IFU Galaxy Survey. [6.54 MB PDF]
Iraklis Konstantopoulos
It turns out that data are just data. So, with only a little effort, your 5D hypercubes will be safe in the same tank as those version-controlled tables. This is the assurance I gave the SAMI team when delivering the original specification for the data archive, albeit slightly paraphrased. The SAMI Galaxy Survey will ultimately collect spatially resolved spectroscopy of ≈3400 galaxies. While managing the volume of data is not in itself a tall order, the nature of the information stored presents an opportunity for astronomers to set up creative information access methods. The primary product being a datacube, that is, a matrix of spectra, one should be free to slice and dice at will, and do so directly on the data archive, rather than being restricted by having to download thousands of individual data files. This is what we have set up for SAMI. A data archive, query engine, and database, all rolled into one, using the HDF5 filesystem through a set of Python codes. I will be presenting an overview of this open source software, which we hope will be of use to other scientists who seek a simple and versatile data access mechanism. I will also review the use case for the Starfish diagram, which was developed to succinctly and simultaneously convey all sorts of information of an individual galaxy (database entry) and the sample in its entirety.
Authors: Iraklis Konstantopoulos
The StArchive: An Open Stellar Archive [16.19 MB PDF]
Angelle Tanner
Historically, astronomers have utilized a piecemeal set of archives such as SIMBAD, the Washington Double Star Catalog, various exoplanet encyclopedias and electronic tables from the literature to cobble together stellar and planetary parameters with the absence of corresponding images and spectra. The mothballed NStED archive was in the process of collecting such data on nearby stars but its course may have changed if it comes back to NASA mission specific targets and NOT a volume limited sample of nearby stars. This means there is void. A void in the available set of tools many exoplanet astronomers would appreciate to create comprehensive lists of the stellar parameters of stars in our local neighborhood. Also, we need better resources for downloading adaptive optics images and published spectra to help confirm new discoveries and find ideal target stars. With so much data being produced by the stellar and exoplanet community we have decided to propose for the creation the Starchive - an open access stellar archive in the spirit of the open exoplanet catalog, the Kepler Community Follow-up Program and many other existing archives. While we will highly regulate and constantly validate the data being placed into our archive the open nature of its design is intended to allow the database to be updated quickly and have a level of versatility which is necessary in today's fast moving, big data exoplanet community. Here, I will introduce the community to the content and expected capabilities of the archive and query the audience for community feedback.
Authors: Tanner, A. et al.
Reducing Large Imaging Data Sets on the Amazon Elastic Compute Cloud [16.68 MB PDF]
Ben Williams
The PHAT survey 6-band UV-IR photometry of over 100 million stars was performed on the Amazon Elastic Compute Cloud. Using this resource for parallel reduction of this large data set required overcoming several technical obstacles regarding security, file sharing, network communication, error trapping, and more. I will discuss how these issues were overcome in order to produce an efficient automated pipeline on EC2, as well as the advantages of using this resource compared to a dedicated compute cluster.
Authors: Ben Williams
Creating A Multiwavelength Galactic Plane Atlas With Amazon Web Services [17.3 MB PDF]
Bruce Berriman
We describe by example how astronomers can optimize cloud-computing resources offered by Amazon Web Services (AWS) to create, curate and serve new datasets at scale. We have produced an atlas of the Galactic Plane at 16 wavelengths from 1 micron to 24 microns with a spatial sampling of 1 arcsec. The atlas has been created using the Montage mosaic engine to generate co-registered mosaics of images released by the major surveys WISE, 2MASS, MSX, GLIMPSE and MIPSGAL. The full atlas is 59 TB in size, and composed of over 9,600 5 degree x 5 degree tiles with one degree overlap between them. The dataset is housed on the Amazon S3 storage platform, designed for at-scale storage with access via web protocols. It is accessible through a prototype web-form and API that will support access to the data according to the users’ query specifications. When the interface is complete, the data set will be made public.
The processing required 340,000 compute hours for completion, carried out on virtual clusters created and managed on AWS platforms through the Pegasus workflow management system. We will describe the optimization methods, compute time and processing costs, as a guide for others wishing to take advantage of cloud platforms for processing and data creation. Our goal is to use the expertise developed here to develop a set of “turnkey” tools, based on Open Source technologies, that will enable scientist to perform processing at scale on cloud platforms with minimal system management knowledge. This is part of a major upgrade in the capabilities of Montage, which will include support for image cubes and data organized in the HEALPix partitioning scheme
Authors: G. B. Berriman, J. C. Good, M. Rynge, E. Deelman, G. Juve, J. Kinney, A. Merrihew
Trident: Custom Scalable Scientific Compute Archive and Analysis [6.56 MB PDF]
Arvind Gopu
As imaging systems improve, the size of astronomical data has continued to grow, making the transfer, secure storage, processing, and analysis of data a significant burden. To solve this problem for the WIYN Observatory One Degree Imager (ODI), we developed the ODI–Pipeline, Portal, and Archive (ODI-PPA), a completely web based solution that provides astronomers a modern user interface and acting as a single point of access to their data, and rich computational and visualization capabilities. It supports scientists in handling complex data sets, while enhancing WIYN's scientific productivity beyond data acquisition.
ODI-PPA is designed to be a Scalable Compute Archive (SCA) that has built-in frameworks including: (1) Collections that allow an astronomer to create logical collations of data products intended for publication, further research, instructional purposes, or to execute data processing tasks (2) Image Explorer and Source Explorer, which together enable real-time interactive visual analysis of massive astronomical data products within an HTML5 capable web browser, and overlaid standard catalog and Source Extractor-generated source markers (3) Workflow framework which enables rapid integration of data processing pipelines on an associated compute cluster and users to request such pipelines to be executed on their data via custom user interfaces.
Once ODI-PPA became operational, we identified other research groups and centers that may benefit from having an SCA system for their users and data. We prototyped a PPA-like system for the Ludwig Maximilian University's Wide Field Imager, and then we adapted our code toward developing a powerful analysis/visualization portal for Globus Cluster System (GCS) data collected by IU researchers in over a decade. Thus far, we have also adapted our code to the IU Electron Microscopy Center's data (EMC-PPA).
Our umbrella project codenamed Trident encompasses ODI-PPA and its offshoots EMC-PPA, LMU-PPA, and the GCS portal. It is made up of several light-weight services connected by a message bus; the web portal built using Twitter/Bootstrap, AngularJS and jQuery JavaScript libraries, and backend services written in PHP (using the Zend framework) and Python; it leverages supercomputing and storage resources at Indiana University. With data volumes and computational needs increasing steadily, and perhaps exponentially in some cases like LSST, we believe that Trident based SCA systems are designed to be reconfigurable for use in any science domain with large and complex datasets with the need for cloud or grid based computational and storage resources.
Authors: Gopu, Hayashi, Young, Henschel, Kotulla, et al.
Handling the Dark Energy Survey [7.59 MB PDF]
Ignacio Sevilla
The Dark Energy Survey is well into the processing of its second year of survey data. In the meantime, scientists have been working through the Science Verification dataset and Year 1 to start producing its first scientific results. In this contribution, we summarize the challenges and current solutions being used by the collaboration, with emphasis on tools for data analysis and visualization, as well as how groups have effectively coordinated in dealing with the large amount of data coming out of the project.
Authors: Ignacio Sevilla for the Dark Energy Survey Collaboration
Efficient data reduction and analysis of DECam images in a multicore server [5.66 MB PDF]
Roberto Munoz
We have embarked on a panchromatic survey of the Fornax galaxy cluster using the DECam instrument at the Blanco telescope. The survey consists of ugriz-band photometry of nine contiguous tiles of the central Fornax cluster region, covering a total region of 27 deg^2. The first challenge we faced was not having enough disk space to store the raw and processed images, and also having slow access to the data when using single drives. We solved both by using a commercial RAID system that consists of 18 HDD drives that offers a storage capacity of 38 TB and delivers a transfer rate of 1,200 MB/s. The second challenge was developing efficient algorithms for doing the registering and stacking of hundred of images using a multicore single server. We solved this by using a combination of the Astromatic software packages, and a set of IDL and Python routines. Last but not least, I will discuss several advanced methods for modeling and subtracting the sky from DECam surveys involving large galaxies in high density environments, and will present a catalog of new dwarf galaxies in Fornax that were discovered using a machine learning algorithm.
Authors: Roberto P. Munoz, T. H. Puzia, M. Taylor, P. Eigenthaler