Symbolic Regression for population models ·

Automated discovery of interpretable gravitational-wave population models

Kaze W.K Wong (Flatiron Institute), Miles Cranmer (Princeton University); (equal contribution)

TL;DR

There will be more and more gravitational wave (GW) events in the coming decade.
To study the GW population, people usually write down simple phenomenological models.
But that’s hard to do when your data is getting more complex, so people turn to flexible models like Gaussian Mixture Model. The problem is, these models are not interpretable.
Symbolic regression can help distilling interpretable models from the flexible models
We apply this to gravitational wave data, but this can be applied to other problems as well!

Main poster (~2 minutes read)

Increasing number of gravitational wave events

In the first LIGO observational run in 2015, there were 3 gravitational wave (GW) detections in 3 months. In the second observational run from November 2016 to August 2017, there were 8 detections in 9 months. Then in the third observational run that started in 2019 and lasted for a year, there were 56 events detected. The GW catalog is growing exponentially, which opens tremendous opportunity to learn astrophysics from the population of GW.

Phenomenological modelling of the GW population

To analyze the population of GWs, the community often fit phenomenological models to the data, such as using a power law or power law plus a Gaussian to model the primary mass distribution. The problem is writing down a sensible phenomenological model depends on individual's intuition, and there are more difficult to construct when the data becomes increasingly complex.

Flexible model

To solve the problem with phenomenological models, the community start looking into more flexible models, such as Gaussian mixture models and neural networks. They are great at fitting the data, however, their flexibility makes them hard to interpret. In particular, they are hard to interpret due to:

Large number of parameters.
Parameters are in non-intuitive basis.

In modelling the primary mass distribution with a power law plus Gaussian model, the spectral index of the power law measures the formation efficiency of compact objects as a function of mass, where the Gaussian measures a pile-up from pair-instability supernova. While one Gaussian could be meaningful, it becomes increasing hard to interpret in physical meaning of each individual Gaussian component when we increase the number of Gaussians.

Symbolic regression primer

Symbolic regression (SR) is a machine learning technique which searches to space of analytic expressions to fit a dataset. An equation can be treated as a tree of operators (e.g. addition, sine), constants and variables. Then we can randomly mutate the tree to obtain different expressions and keep the ones that fit the data well. This is often done with a genetic algorithm, which allow cross-breeding and improve the search efficiency.

Interpreting flexible model's result

In this work, we use symbolic regression to distill more interpretable expressions from a Gaussian mixture model fitted to the GW primary mass distribution. As we increase the complexity of the equation (i.e. number of nodes in the tree.), the fitting accuracy will improve with the cost of interpretibility. We show three equations with increasing complexity in the animation. The first equation fits for the excess of event around 10 solar mass with a Gaussian and the rest of the mass distribution with an exponential function. This describe the bulk of the distribution, but unfortunate cannot accommodate the excess of events around 30 solar mass. Once we allow for higher complexity, SR add another Gaussian around 30 solar mass, which is what the state-of-the-art phenomenological model suggests. If we allow for even higher complexity, SR tries to taper the exponential function to fit the tail of the distribution.

Take home messages

Flexible models are very popular in data analysis due to their capability to fit complex dataset. However, they are often difficult to interpret due to their large number of parameters in some non-intuitive basis. SR can be a tool to distill more interpretable out of these flexible models. Perhaps the coolest part of this work is the result you get out of the pipeline is a collection of models, which means you can use these models in typical data analysis tasks such as parameter estimation or model selection (And print it in your paper). One can surely take the equations fitted by SR and try to make scientific conclusion with them. But in case the referee ask “What about this robustness check?”, just plug it in a traditional data analysis pipeline, and all follows. After all, what’s the difference between a carbon-based ConvNet who look at data and propose equations and a silicon-based one?

A couple more things

Showyourwork

Our paper is prepared using the package showyourwork, which is a workflow management tool for reproducible, extensible, transparent, and just downright awesome open source scientific articles (in their own wording). You can find the source code of our latex file, data file and script which generate the figures in this github repository. We welcome comments in the form of github issue. If you think reproducibility and openness are important to scientific publication, give showyourwork a try!

Manim animation

The animation in this poster is made using an animation engine for explanatory math videos in python, which is originally developed by Grant Sanderson for his YouTube channel 3Blue1Brown, then subsequently being developed by the manim community. We use the community version for the videos in this poster. If you are interested in start making explanatory video in this style, checkout the engine! Also, the source code for creating the animations in this poster can be found in this repo.

The authors

Kaze is a Flatiron research fellow at the Flatiron Institute, who works mostly on everything related to gravitational waves (expect instrumentation.) He also tries to apply machine learning methods such as normalizing flow to solve modern data analysis problems in astrophysics. He loves making animations and he tries to make one animation for each paper he publishes. Twitter GitHub Youtube

Miles is a PhD candidate at the Princeton University working on the intersection between astrophysics and machine learning. He is the main contributor and maintainer of a population symbolic regression python package called PySR. Twitter GitHub