Catherine Zucker is a fourth-year PhD student in astronomy at Harvard. Her work focuses on understanding the structure of our Milky Way Galaxy, through the combination of observations, numerical simulations, statistics, and data visualization. She is advised by Professors Alyssa Goodman and Douglas Finkbeiner. Alyssa Goodman is the Robert Wheeler Willson Professor of Applied Astronomy, a Co-Founder of the Initiative in Innovative Computing and a member of the Harvard Data Science Initiative steering committee, whose research spans astronomy, data visualization and online systems for research and education. Douglas Finkbeiner holds joint professorships in the departments of Astronomy and Physics, whose research spans high-energy astrophysics, dark-matter annihilation, Galactic structure, interstellar dust, and large photometric surveys.
Managing “Big” Data:
As a researcher who is trying to understand the structure of the Milky Way, I often deal with very large astronomical datasets (terabytes of data, representing almost two billion unique stars). Every single dataset we use is publicly available to anyone, but the primary challenge in processing them is just how large they are. Most astronomical data hosting sites provide an option to remotely query sources through their web interface, but it is slow and inefficient for our science. We do many, many searches over different areas of the sky, with advanced SQL-like queries, and we use data not just from one telescope, but from many telescopes.
To circumvent this issue, we download all the catalogs locally to Harvard Odyssey, with each independent survey housed in a separate database. We use a special python-based tool (the “Large-Survey Database”) developed by a former post-doctoral scholar at Harvard, which allows us to perform fast queries of these databases simultaneously using the Odyssey computing cluster. We break down the sky into different pixels and query them independently using the Large Survey Database tool; we then store the curated data products in different pixels on the sky as hdf5 files, a structured data file format that allows for easy management and processing of large, heterogeneous datasets.
To extract information from each hdf5 file, we have developed a sophisticated Bayesian analysis pipeline that reads in our curated hdf5 files and outputs best fits for our model parameters (in our case, distances to local star-forming regions near the sun). Led by a graduate student and co-PI on the paper (Joshua Speagle), the python codebase is publicly available on GitHub with full API documentation. In the future, it will be archived with a permanent DOI on Zenodo. Also on GitHub users will find full working examples of the code, demonstrating how users can read in the publicly available data and output the same style of figures seen in the paper. Sample data are provided, and the demo is configured as a jupyter notebook, so interested users can walk through the methodology line-by-line.
Archiving the Data:
Our statistical pipeline outputs information on over one thousand lines of sight, distributed in different areas of the sky. Publishing dozens of data tables in the paper itself is not feasible, nor would it be useful to scientists, as they are not machine-readable. In the same vein, not all of the parameters we fit for would be of interest to the average reader, as many of them are “fudge factors,” of interest only to a small fraction of the most dedicated readers curious about our statistical methodology. Thus, we have included a sample table in the paper, with a reduced number of model parameters, and made everything else available via the Harvard Dataverse, which provides machine-readable catalogs of all our data products. These data products are hyperlinked to the paper via permanent DOIs, evading the issue of “link rot” which many datasets are subject to in the current publishing era.
Telling a Story:
One of the goals of our study was to be able to tell a story and to enable researchers to interact with our data via interactive figures. Currently, less than 1% of articles submitted to the main set of astronomical journals (AAS Publishing) are interactive, but interactive figures are currently supported by the journals. With the help of the bokeh python package, we were able to make the majority of the figures in our paper interactive, and they will appear so in the html version of the online journal. For now, in the PDF of the paper, we link to html files hosted on an Odyssey computing node, but in the future, we also hope to host the interactive files on the Dataverse as well! You can check out one of the interactive figures below, showing how the gas and dust in a famous local star-forming region (the `Cepheus’ cloud) is distributed along the line of sight. Each pixel is colored by the distance from the same, and you can hover over any pixel to see how far away it is!
Caption: A static version of our interactive figure that will appear in the article’s PDF format. The actual interactive figure will be made available in the html version of the article and is also hosted on an Odyssey computing node here. The left panel shows the distribution of dust and gas in the nearby “Cepheus” star-forming cloud; we pixelate the cloud and fit distances to each sightline. Our results are shown in the right panel, with each sightline colored by our inferred distance. The two different colors (purple and green) means this cloud is actually two clouds, at two different distances along the line of sight!
For the Future:
In the future, we hope to implement even more of the practices put forth in the “Paper of the Future”, including the use of video abstracts, incorporation of “3D” pdfs, and putting images in context through tools like the WorldWide Telescope or glue. For now, we hope this is one small step towards the future of scientific publishing!
About the paper:
Zucker, C. and Speagle J. (co-PIs), Schlafly, E., Green, G., Finkbeiner, D., Goodman, A., and Alves, J. 2019. Submitted to The Astrophysical Journal. A Large Catalog of Accurate Distances to Local Molecular Clouds: The Gaia DR2 Edition. Available as an arXiv preprint here.