Dr. Viviana Acquaviva
  • Home
  • Research
  • students
  • Contact
Picture

Welcome to the TerraD2I (Data to Insights) Lab!
We work at the interface of models and data
and use statistical and machine learning tools
to study different aspects of the Earth System.
Read about our current projects below
and consider joining us!


CARBONara
Picture

This project aims to reconstruct the global surface ocean pCO2 field, starting from observations that are extremely sparse in space and time.  Because of data sparsity, the reconstruction of the full field relies on additional information that can be measured from satellites, such as the temperature and salinity of the ocean. These become the features of a machine learning model that is trained to predict pCO2 using the available observations as a learning set. The predictions for the ML model are then used for "infilling" or reconstructing the full pCO2 field, which serves to estimate the global ocean carbon sink. This is a naturally difficult problem for ML methods, because there is an unsolvable distribution shift between the training domain (where observations are available) and the application domain (all other points in space and time). The project's objective is to improve this reconstruction, making it more accurate and robust.

Collaborators: Galen McKinley's group at Columbia University and LDEO, including Amanda Fay, Thea Heimdal, and Abby Shaum; Romina Wild (OGS, Trieste); Alessandro Laio (SISSA).
Picture
       Metrics Reloaded

Picture
In this project, led by postdoctoral researcher Gabriele Accarino, we are developing new metrics to capture the complex multi-scale behavior of climate fields, such as temperature and precipitation, with the goal of improving model evaluation and boost models' forecasting skills. Our new metric, WaveSim, goes beyond point-like ones by using a wavelet decomposition framework to attribute differences between maps to physical scales, and can point to magnitude, displacement, and/or structure differences as the underlying cause.

Collaborators: Gabriele Accarino (Columbia), Sara Shamekh (NYU), Duncan Watson-Parris (UCSD), and Dave Lawrence (NCAR).
The material below is my from my pre-pivot work in Astrophysics. 

PictureFeature ranking of galaxies divided by their bulge-to-stellar mass ratio.
In early 2025, my possibly final Astro paper with Festa Buçinca and Ari Maller came out! We tested a pipeline to discover hypotheses for physical models. We looked at whether we can recover a model for the origin of galaxy sizes in semi-analytic model of galaxy formation, using feature ranking techniques to figure out what variables are important, and symbolic regression to generate our hypotheses. Spoiler alert: it's really hard! 

 As a spin-off from the project described below, I have been working with Chris Lovell, and more recently Kartheik Iyer and a few other folks at the KITP Workshop on Galaxy Evolution x ML, on learning effective physical representations for galaxy spectra. We explored two ideas: 1: Use clustering in feature space to reduce correlation and dimensionality; 2: Create a “superset” of features that contains fluxes and gradients on different scales. Of course, the two can also be combined. We found that this works quite well: we can determine galaxy physical parameters (e.g., stellar mass, star formation rate, median stellar age) with minimal information loss (0-20% loss in MSE in going from 2000 to 10-100 features). Examples of the clusters we found can be seen on the right (affectionately dubbed "Re-discovering Photometry").
Picture
Image credit: Chris Lovell
Picture
A long term (most likely, infinite-term) project, with Chris Lovell and Emille Ishida, has the goal to predict the generalization error in predicting galaxy physical properties (stellar mass, SFHs...) when using machine learning models trained on different simulations. Our hypothesis is that we can build a reasonably tight regression relation between the generalization error and a suitable distance metric in the (observed) space of simulated spectra. The final goal is to predict the generalization error on data, even without training labels. We test this idea by training models to predict Stellar Mass on 20 sets of simulations, and applying  them to each of the other 19. We show one example plot where the “target" set of spectra is simulation 1, and we show the MSE that we obtain when we apply the 20 learned functions, versus our chosen distance metric, the Euclidean Distance on 100 most significant features. There is a clear trend that suggest the possibility of fitting the regression successfully and predicting the generalization error on data (where only the distance is known). From our recent NeurIPS paper.

Picture
In another work, in collaboration with Andy Lawler, we used Bayesian model selection to understand which star formation history models are favored by observations. We begin by asking: Can we infer *from data* how many major episodes of star formation a galaxy underwent, using the ratio of evidences (Bayes factor) between nested models of various complexity? We answer: most often yes, but you need exquisite photometry. In the figure, we show how the Bayes factor (ratio of evidence of "correct" model versus simpler model) varies as a result of changing the S/N, from current to planned surveys. The ideal case is the one in which all Bayes factors are > 1, and we see that this percentage increases significantly as the S/N increases. We also ask: is the Savage-Dickey Density Ratio, a procedure to compute Bayes factors in nested models, a valid substitute for the evidence-based calculation? We answer: yes, bot only you have good sampling around the tails. This is relevant because the SDDR can be derived from "regular", not computationally expensive MCMC chains, unlike the evidence!


Picture
In 2019,  Chris Lovell and I + collaborators published a paper in which we train machine learning and deep learning (CNNs) methods to understand the relationship between galaxy spectra and their star formation histories, using Illustris and EAGLE to train the models and the Flexible Stellar Population Synthesis models to model galaxies. We found great things:

  • Matching the SDSS coverage and S/N for massive galaxies (log(M) > 10 MSun), we can constrain the SFH of a galaxy in 8 lookback-time bins with an average accuracy of 10 -20 % in each bin (see figure for a set of six cases, from best to worst case).
  • The numbers don't blow up even if we train on one simulation and we test on the other.
However, we still felt that we needed to figure out what the generalization properties of these models will be on data, which prompted our follow-up project (above).


PicturePretty colors trace likelihood in this plot showing galaxy properties from VA, Vargas, Gawiser and Guaita 2012.
Digging a bit further in the past, understanding the physical properties of galaxies, such as stellar mass, star formation histories, dust content, redshift, and metallicity, through Spectral Energy Distribution fitting has been a major focus of my research for several years. I wrote two Markov Chain Monte Carlo code for SED fitting. I still use them but now they have tons of competitors (and that's a very good thing!), see the GalMC page.

Don’t have data yet? If you are planning a survey and would like to know how well you can constrain the physical properties of galaxies with your observations (or want to try out a few different ones and see which one works best), you are welcome to use GalFish.

For some of my early ML work on the correlation between photometric properties and gas-phase metallicity you can check out this  paper.


  • Home
  • Research
  • students
  • Contact