Research

Image credit: Dr Ayana Elizabeth Johnson

My main research theme is learning from simulations, both in Climate Science and in Astronomy. I am broadly interested in model discovery, causal structure learning, representation learning, and uncertainty quantification. Even more broadly, I am constantly trying to figure out what goes in the diagram on the left for me.

These are some of the things I am up to right now:

In my first climate project, we are planning to develop and test new custom metrics to assess similarity in climate maps in a way that connects with the physical differences of global models. An example for precipitation maps is shown on the right. I will be hiring a postdoc in 2024 to work on this problem!

The final version of this figure will summarize our causal analysis with Directed Acyclic Graphs. Credit: S. di Gioia.

In another ongoing project, I am working with Festa Buçinca and Ari Maller to test a pipeline to discover hypotheses for physical models. We are looking at whether we can recover a model for the origin of galaxy sizes in semi-analytic model of galaxy formation, using feature ranking techniques to figure out what variables are important, and symbolic regression to generate our hypotheses. Spoiler alert: it's really hard! Together with Serafina di Gioia and Roberto Trotta, we are trying to figure out if causal structure learning (left) can help.

As a spin-off from the project described below, I have been working with Chris Lovell, and more recently Kartheik Iyer and a few other folks at the KITP Workshop on Galaxy Evolution x ML, on learning effective physical representations for galaxy spectra. We are exploring two ideas: 1: Use clustering in feature space to reduce correlation and dimensionality; 2: create a “superset” of features that contains fluxes and gradients on different scales. Of course, the two can also be combined. We found that this works quite well: we can determine galaxy physical parameters (e.g., stellar mass, star formation rate, median stellar age) with minimal information loss (0-20% loss in MSE in going from 2000 to 10-100 features). Examples of the clusters we found can be seen on the right (affectionately dubbed "Re-discovering Photometry"). Writing the paper is one of my resolutions for 2024.

Image credit: Chris Lovell

A long term project, with Chris Lovell and Emille Ishida, has the goal to predict the generalization error in predicting galaxy physical properties (stellar mass, SFHs...) when using machine learning models trained on different simulations. Our hypothesis is that we can build a reasonably tight regression relation between the generalization error and a suitable distance metric in the (observed) space of simulated spectra. The final goal is to predict the generalization error on data, even without training labels. We test this idea by training models to predict Stellar Mass on 20 sets of simulations, and applying them to each of the other 19. We show one example plot where the “target" set of spectra is simulation 1, and we show the MSE that we obtain when we apply the 20 learned functions, versus our chosen distance metric, the Euclidean Distance on 100 most significant features. There is a clear trend that suggest the possibility of fitting the regression successfully and predicting the generalization error on data (where only the distance is known). From our recent NeurIPS paper.

In another recently completed work, in collaboration with Andy Lawler, we use Bayesian model selection to understand which star formation history models are favored by observations. We begin by asking: Can we infer *from data* how many major episodes of star formation a galaxy underwent, using the ratio of evidences (Bayes factor) between nested models of various complexity? We answer: most often yes, but you need exquisite photometry. In the figure, we show how the Bayes factor (ratio of evidence of "correct" model versus simpler model) varies as a result of changing the S/N, from current to planned surveys. The ideal case is the one in which all Bayes factors are > 1, and we see that this percentage increases significantly as the S/N increases. We also ask: is the Savage-Dickey Density Ratio, a procedure to compute Bayes factors in nested models, a valid substitute for the evidence-based calculation? We answer: yes, bot only you have good sampling around the tails. This is relevant because the SDDR can be derived from "regular", not computationally expensive MCMC chains, unlike the evidence!

In 2019, Chris Lovell and I + collaborators published a paper in which we train machine learning and deep learning (CNNs) methods to understand the relationship between galaxy spectra and their star formation histories, using Illustris and EAGLE to train the models and the Flexible Stellar Population Synthesis models to model galaxies. We found great things:

Matching the SDSS coverage and S/N for massive galaxies (log(M) > 10 MSun), we can constrain the SFH of a galaxy in 8 lookback-time bins with an average accuracy of 10 -20 % in each bin (see figure for a set of six cases, from best to worst case).
The numbers don't blow up even if we train on one simulation and we test on the other.

However, we still felt that we needed to figure out what the generalization properties of these models will be on data, which prompted our follow-up project (above).

Additional stray ideas/paused projects include using a combination of supervised and unsupervised techniques to develop improved indicators of gas-phase metallicity in galaxies. It started with a re-calculation of emission line strength in the SDSS data set using FADO, and I hope to pick it up again soon!

Pretty colors trace likelihood in this plot showing galaxy properties from VA, Vargas, Gawiser and Guaita 2012.

Digging a bit further in the past, understanding the physical properties of galaxies, such as stellar mass, star formation histories, dust content, redshift, and metallicity, through Spectral Energy Distribution fitting has been a major focus of my research for several years. I wrote two Markov Chain Monte Carlo code for SED fitting. I still use them but now they have tons of competitors (and that's a very good thing!), see the GalMC page.

Don’t have data yet? If you are planning a survey and would like to know how well you can constrain the physical properties of galaxies with your observations (or want to try out a few different ones and see which one works best), you are welcome to use GalFish.

For some of my early ML work on the correlation between photometric properties and gas-phase metallicity you can check out this paper.