| Issue |
A&A
Volume 703, November 2025
|
|
|---|---|---|
| Article Number | A301 | |
| Number of page(s) | 9 | |
| Section | Cosmology (including clusters of galaxies) | |
| DOI | https://doi.org/10.1051/0004-6361/202554602 | |
| Published online | 25 November 2025 | |
The BIG SOBOL SEQUENCE: How many simulations do we need for simulation-based inference in cosmology?
1
CNRS & Sorbonne Université, Institut d’Astrophysique de Paris (IAP), UMR 7095, 98 Bis bd Arago, F-75014 Paris, France
2
Department of Physics and Astronomy, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA
3
Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218, USA
4
Center for Computational Astrophysics, Flatiron Institute, 162 5th Avenue, New York, NY 10010, USA
5
Department of Astrophysical Sciences, Princeton University, Peyton Hall, Princeton, NJ 08544-0010, USA
⋆ Corresponding author: bairagi@iap.fr
Received:
18
March
2025
Accepted:
24
August
2025
How many simulations do we need to train machine learning methods to extract information available from summary statistics of the cosmological density field? Neural methods have shown the potential to extract nonlinear information available from cosmological data. To achieve this requires appropriate network architectures and a sufficient number of simulations for training the networks. This is the first detailed convergence study in which a neural network (NN) is trained to extract maximally informative summary statistics for cosmological inference. We show that currently available simulation suites, such as the Quijote Latin Hypercube with 2000 simulations, do not provide sufficient training data for a generic NN to reach the optimal regime. We present a case study in which we train a moment network to infer cosmological parameters from the nonlinear dark matter power spectrum, where the optimal information content can be computed through asymptotic analysis using the Cramér-Rao information bound. We find an empirical neural scaling law that predicts how much information a NN can extract from highly informative summary statistics as a function of the number of simulations used to train the network, for a wide range of architectures and hyperparameters. Looking beyond two-point statistics, we find a similar scaling law for the training of neural posterior inference using wavelet scattering transform coefficients. To verify our method, we created the largest publicly released suite of cosmological simulations, the BIG SOBOL SEQUENCE (BSQ), consisting of 32 768 Λ cold dark matter N-body simulations uniformly covering the Λ cold dark matter parameter space. Our method enables efficient planning of simulation campaigns for machine learning applications in cosmology, while the BSQ dataset provides an unprecedented resource for studying the convergence behavior of NNs in cosmological parameter inference. Our results suggest that new large simulation suites or new training approaches will be necessary to infer information-optimal parameters from nonlinear simulations.
Key words: methods: statistical / cosmological parameters / dark matter / large-scale structure of Universe
© The Authors 2025
Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This article is published in open access under the Subscribe to Open model. Subscribe to A&A to support open access publication.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.