
My research is at the interface between Astronomy and Data Science.
I am especially interested in using statistical techniques and machine learning methods to explore large galaxy catalogs and learn something about the evolution of the Universe.
These are some of the things I am up to right now:
I am especially interested in using statistical techniques and machine learning methods to explore large galaxy catalogs and learn something about the evolution of the Universe.
These are some of the things I am up to right now:

A long term project, with Chris Lovell and Emille Ishida, has the goal to predict the generalization error in predicting galaxy physical properties (stellar mass, SFHs...) when using machine learning models trained on different simulations. Our hypothesis is that we can build a reasonably tight regression relation between the generalization error and a suitable distance metric in the (observed) space of simulated spectra. The final goal is to predict the generalization error on data, even without training labels. We test this idea by training models to predict Stellar Mass on 20 sets of simulations, and applying them to each of the other 19. We show one example plot where the “target" set of spectra is simulation 1, and we show the MSE that we obtain when we apply the 20 learned functions, versus our chosen distance metric, the Euclidean Distance on 100 most significant features. There is a clear trend that suggest the possibility of fitting the regression successfully and predicting the generalization error on data (where only the distance is known). From our recent NeurIPS paper.

In another recently completed work, in collaboration with Andy Lawler, we use Bayesian model selection to understand which star formation history models are favored by observations. We begin by asking: Can we infer *from data* how many major episodes of star formation a galaxy underwent, using the ratio of evidences (Bayes factor) between nested models of various complexity? We answer: most often yes, but you need exquisite photometry. In the figure, we show how the Bayes factor (ratio of evidence of "correct" model versus simpler model) varies as a result of changing the S/N, from current to planned surveys. The ideal case is the one in which all Bayes factors are > 1, and we see that this percentage increases significantly as the S/N increases. We also ask: is the Savage-Dickey Density Ratio, a procedure to compute Bayes factors in nested models, a valid substitute for the evidence-based calculation? We answer: yes, bot only you have good sampling around the tails. This is relevant because the SDDR can be derived from "regular", not computationally expensive MCMC chains, unlike the evidence!

In 2019, Chris Lovell and I + collaborators published a paper in which we train machine learning and deep learning (CNNs) methods to understand the relationship between galaxy spectra and their star formation histories, using Illustris and EAGLE to train the models and the Flexible Stellar Population Synthesis models to model galaxies. We found great things:
Additional stray ideas/paused projects include using a combination of supervised and unsupervised techniques to develop improved indicators of gas-phase metallicity in galaxies. It started with a re-calculation of emission line strength in the SDSS data set using FADO, and I hope to pick it up again soon!
- Matching the SDSS coverage and S/N for massive galaxies (log(M) > 10 MSun), we can constrain the SFH of a galaxy in 8 lookback-time bins with an average accuracy of 10 -20 % in each bin (see figure for a set of six cases, from best to worst case).
- The numbers don't blow up even if we train on one simulation and we test on the other.
Additional stray ideas/paused projects include using a combination of supervised and unsupervised techniques to develop improved indicators of gas-phase metallicity in galaxies. It started with a re-calculation of emission line strength in the SDSS data set using FADO, and I hope to pick it up again soon!

Digging a bit further in the past, understanding the physical properties of galaxies, such as stellar mass, star formation histories, dust content, redshift, and metallicity, through Spectral Energy Distribution fitting has been a major focus of my research for several years. I wrote two Markov Chain Monte Carlo code for SED fitting. I still use them but now they have tons of competitors (and that's a very good thing!), see the GalMC page.
Don’t have data yet? If you are planning a survey and would like to know how well you can constrain the physical properties of galaxies with your observations (or want to try out a few different ones and see which one works best), you are welcome to use GalFish.
For some of my early ML work on the correlation between photometric properties and gas-phase metallicity you can check out this paper.
I am a member of the Hobby-Eberly Telescope Dark Energy eXperiment (HETDEX) collaboration, which aims to use Lyman Alpha Emitting galaxies to study the behavior of Dark Energy at early times, when we don’t expect to see much of it (but then we weren’t expecting dark energy at all, so I am hopeful for surprises).
Don’t have data yet? If you are planning a survey and would like to know how well you can constrain the physical properties of galaxies with your observations (or want to try out a few different ones and see which one works best), you are welcome to use GalFish.
For some of my early ML work on the correlation between photometric properties and gas-phase metallicity you can check out this paper.
I am a member of the Hobby-Eberly Telescope Dark Energy eXperiment (HETDEX) collaboration, which aims to use Lyman Alpha Emitting galaxies to study the behavior of Dark Energy at early times, when we don’t expect to see much of it (but then we weren’t expecting dark energy at all, so I am hopeful for surprises).