Inferring planet occurrence rates from radial velocities

J. P. Faria; J.-B. Delisle; D. Ségransan

doi:10.1051/0004-6361/202557774

Home

All issues

Volume 704 (December 2025)

A&A, 704 (2025) L19

Full HTML

Open Access

Issue		A&A Volume 704, December 2025


Article Number		L19
Number of page(s)		7
Section		Letters to the Editor
DOI		https://doi.org/10.1051/0004-6361/202557774
Published online		18 December 2025

A&A, 704, L19 (2025)

Letter to the Editor

Inferring planet occurrence rates from radial velocities

J. P. Faria^★, J.-B. Delisle and D. Ségransan

Observatoire Astronomique de l’Université de Genève, Chemin Pegasi 51b, 1290 Versoix, Switzerland

^★ Corresponding author: This email address is being protected from spambots. You need JavaScript enabled to view it.

Received: 20 October 2025
Accepted: 1 December 2025

Abstract

We introduce a new method to infer the posterior distribution for planet occurrence rates from radial velocity (RV) observations. The approach combines posterior samples from the analysis of individual RV datasets of several stars, using importance sampling to re-weight them appropriately. This eliminates the need for injection-recovery tests to compute detection limits and avoids the explicit definition of a detection threshold. We validate the method on simulated RV datasets and show that it yields unbiased estimates of the occurrence rate in different regions with increasing precision as more stars are included in the analysis.

Key words: methods: data analysis / methods: statistical / techniques: radial velocities

© The Authors 2025

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe to Open model. This email address is being protected from spambots. You need JavaScript enabled to view it. to support open access publication.

1. Introduction

The observed population of exoplanets provides a direct probe into the complex processes governing the formation and evolution of planetary systems, although it is inevitably affected by the biases inherent in the different detection techniques. For instance, the radial velocity (RV) method is most sensitive to massive planets with intermediate orbital periods compared to the transit or direct imaging methods. (e.g. Wright & Gaudi 2013).

An important goal of many RV surveys is to characterise planet occurrence rates across different stellar populations (e.g. Cumming et al. 1999; Howard et al. 2010; Bonfils et al. 2013b; Rosenthal et al. 2021, and references therein). Achieving this requires careful sample selection, accurate stellar characterisation, detailed analysis of the individual RV time series, and an assessment of survey completeness to correct for observational biases.

In this Letter, we present a simple method to combine results from independent RV analyses into a probabilistic estimate of the occurrence rate of planets within a given region of parameter space. The approach bypasses the explicit definition of detection thresholds and the calculation of detection limits or survey completeness, relying only on posterior samples from existing analyses. This allows for efficiently estimating occurrence rates in different regions and stellar samples.

Section 2 reviews the Bayesian approach to RV analysis and outlines the proposed method. In Section 3, we present an application of the method to synthetic RV datasets, and in Section 4 we discuss the results.

2. Methods

2.1. Bayesian RV analysis

The use of Bayesian methods in the analysis of RVs is now an almost standard practice. The seminal work of Ford (2005) and Gregory (2005) provided the general framework for the use of Markov chain Monte Carlo techniques to sample the posterior distribution of the model parameters. More recently, nested sampling (Skilling 2004, 2006) and several variants (e.g. Higson et al. 2019; Feroz et al. 2009; Buchner 2023) have been widely used for more efficient sampling and to allow for the comparison of competing models.

The Bayesian approach boils down to a transformation of a prior distribution, π(θ | ℳ), for the parameters θ of a model ℳ into a posterior distribution, p(θ | 𝒟, ℳ), after having observed data 𝒟. Bayes’ theorem relates these distributions as

$\begin{matrix} p (D | M) \cdot p (θ | D, M) = p (D | θ, M) \cdot π (θ | M), \end{matrix}$ $\begin{aligned} p \left( \mathcal{D} {\,\vert \,} \mathcal{M} \right) \cdot p \left( \theta {\,\vert \,} \mathcal{D} , \mathcal{M} \right) = p \left( \mathcal{D} {\,\vert \,} \theta , \mathcal{M} \right) \cdot \pi \left( \theta {\,\vert \,} \mathcal{M} \right) , \end{aligned}$ (1)

where we introduce the likelihood p(𝒟 | θ,ℳ) and the evidence p(𝒟 | ℳ), often denoted simply by 𝒵. Writing Bayes’ theorem in this form emphasises the inputs (prior and likelihood) and the outputs (evidence and posterior) of any Bayesian analysis (see Skilling 2004, for a discussion).

In the RV analysis context, the number of orbiting planets in a given system, N_p, is an interesting quantity. Often, the value of N_p is fixed in sequence, and the evidence is used to compare the models and estimate the number of planets significantly detected in the dataset (e.g. Feroz et al. 2011). This requires the definition of a threshold for what is considered a significant detection, or a significant difference in evidence. Although much literature has been dedicated to this topic, the interpretation scale introduced by Jeffreys (1939) and refined by Kass & Raftery (1995) is very often used in the exoplanet community.

Alternatively, N_p itself can be included in the set of model parameters and its posterior distribution estimated directly. Since N_p is a discrete parameter and it controls the dimensionality of the model, certain algorithms (often called trans-dimensional) are more adequate for this task (e.g. Green 1995; Brewer 2014). Such methods were first applied to RV analysis by Brewer & Donovan (2015) and have since been used to characterise several exoplanet systems (e.g. Faria et al. 2016; Lillo-Box et al. 2020; Faria et al. 2022; Baycroft et al. 2023; John et al. 2023). The posterior for N_p allows for efficient model comparison, in particular when several planets are detected.

2.2. Compatibility limits

A typical output from the RV analysis of a particular system, whether it led to planet detections or not, is an upper limit on the mass of planets that can be ruled out given the observations. These detection limits are almost always estimated by injecting mock signals into a residual dataset and assessing their recovery, usually with a periodogram analysis (e.g. Cumming et al. 1999; Mayor et al. 2011; Bonfils et al. 2013a).

This injection-recovery procedure is not well justified statistically and can be computationally costly. Moreover, it requires the definition of a threshold for the detection and the subtraction of a single solution from the original dataset, therefore ignoring any uncertainty and correlations present in the posterior.

But, in fact, the posterior distribution may already contain all the available information about which orbital parameters are compatible with the data. For example, if no signals are detected, all posterior samples from models with N_p ≥ 1 are compatible with the data and can be used to estimate so-called compatibility limits (Standing et al. 2022; Figueira et al. 2025, see also Tuomi et al. 2014), akin to the detection limits. If instead one signal is detected, posterior samples with N_p ≥ 2 can be used.

This approach has several advantages. First, compatibility limits are available from a single run of the sampling algorithm, provided the prior for N_p allows for sufficiently high values. Second, orbital parameters such as eccentricity can be effectively marginalised over, subject to the prior distribution. Finally, the compatibility limits can be calculated without subtracting any single solution from the original data. While useful in providing an estimated sensitivity of an RV dataset to planets of different periods and masses, which could be combined into a survey sensitivity map, the compatibility limits are not required for determining the occurrence rate.

2.3. Planet occurrence rates

The analysis of RV datasets of multiple stars provides information about the occurrence rate of planets. We define occurrence rate as the fraction of stars with at least one planet in a region R of the parameter space¹. The region can be defined in different ways, taking different planet properties into account. We only consider lower and upper limits in orbital period and in planetary minimum mass. The occurrence rate generally depends on the choice of the region.

In probabilistic terms, we define the event ‘there is at least one planet with parameters in a given region’. This event, here denoted as 𝒫, takes values 1 or 0 (true or false),

$\begin{matrix} P \equiv {\begin{matrix} 1 if θ \in R \\ 0 otherwise, \end{matrix} \end{matrix}$ $\begin{aligned} \mathcal{P} \equiv {\left\{ \begin{array}{ll} 1 \quad \text{ if} \, \theta \in R \\ 0 \quad \text{ otherwise}, \end{array}\right.} \end{aligned}$ (2)

and for each individual star follows a Bernoulli distribution with probability equal to the occurrence rate f_R, so that

$\begin{matrix} p (f_{R} ; P) = f_{R} \cdot P + (1 - f_{R}) \cdot (1 - P) . \end{matrix}$ $\begin{aligned} p \left( {f_R}; \mathcal{P} \right) = {f_R} \cdot \mathcal{P} + \left(1 - {f_R}\right) \cdot \left(1 - \mathcal{P} \right) . \end{aligned}$ (3)

For a total of S stars (indexed by j), each with an RV dataset 𝒟_j, we have M_j samples from the posterior p(θ_j | 𝒟_j), obtained from the likelihood p(𝒟_j | θ_j) and prior π(θ_j), where we omitted the dependence on the model ℳ. If the observations of each star are independent, the total likelihood for all the parameters of all stars is equal to the product of the individual-star likelihoods:

$\begin{matrix} p ({D_{j}}_{S} | {θ_{j}}_{S}) = \prod_{j = 1}^{S} p (D_{j} | θ_{j}), \end{matrix}$ $\begin{aligned} p \left( \left\{ \mathcal{D} _j \right\} _S\, \big \vert \, \left\{ \theta _j \right\} _S \right) = \prod _{j=1}^S p \left( \mathcal{D} _j {\,\vert \,} \theta _j \right), \end{aligned}$ (4)

where the notation {.}_a represents the set of a elements. We comment on the independence assumption in Section 4.

We are interested in the posterior distribution (and thus in the likelihood) in terms of the occurrence rate f_R, which implies marginalising out the individual-star parameters:

$\begin{matrix} p ({D_{j}}_{S} | f_{R}) = \prod_{j = 1}^{S} \int d θ_{j} \cdot p (D_{j} | θ_{j}) \cdot p (θ_{j} | f_{R}) . \end{matrix}$ $\begin{aligned} p \left( \left\{ \mathcal{D} _j \right\} _S\, \big \vert \, {f_R} \right) = \prod _{j=1}^S \int d\theta _j \cdot p \left( \mathcal{D} _j {\,\vert \,} \theta _j \right) \cdot p \left( \theta _j {\,\vert \,} {f_R} \right). \end{aligned}$ (5)

An issue arises because we do not have access to p(θ_j | f_R) from the individual analyses, since the occurrence rate was not taken into account when computing the posteriors². In order to change variables without having to redo the individual analyses (for each different region), we employ importance sampling to re-weight the original prior (e.g. Hogg et al. 2010):

$\begin{matrix} p (θ_{j} | f_{R}) \equiv \frac{p (f_{R} ; P_{j}) \cdot π (θ_{j})}{π (P_{j})} . \end{matrix}$ $\begin{aligned} p \left( \theta _j {\,\vert \,} {f_R} \right) \equiv \frac{ p({f_R}; \mathcal{P} _j) \cdot \pi (\theta _j) }{ \pi (\mathcal{P} _j) }. \end{aligned}$ (6)

Here, π(𝒫_j) is the distribution of 𝒫_j under the original prior. Substituting Eq. (6) into Eq. (5) results in integrals over the posterior distribution p(𝒟_j | θ_j)⋅π(θ_j) multiplied by p(f_R; 𝒫_j)/π(𝒫_j). These integrals can be approximated using averages of the posterior samples from each star so that the likelihood becomes

$\begin{matrix} p ({D_{j}}_{S} | f_{R}) \approx \prod_{j = 1}^{S} \frac{1}{M_{j}} \sum_{k = 1}^{M_{j}} \frac{p (f_{R} ; P_{jk})}{π (P_{jk})} . \end{matrix}$ $\begin{aligned} p \left( \left\{ \mathcal{D} _j \right\} _S \big \vert \, {f_R} \right) \approx \prod _{j=1}^S \frac{1}{M_j} \sum _{k=1}^{M_j} \frac{ p({f_R}; \mathcal{P} _{jk}) }{ \pi (\mathcal{P} _{jk}) }. \end{aligned}$ (7)

For the original prior, the equivalent to Eq. (3) is

$\begin{matrix} π (f_{0} ; P_{j}) = f_{0} \cdot P_{j} + (1 - f_{0}) \cdot (1 - P_{j}), \end{matrix}$ $\begin{aligned} \pi (f_0; \mathcal{P} _j) = f_0 \cdot \mathcal{P} _j + (1 - f_0) \cdot (1 - \mathcal{P} _j), \end{aligned}$ (8)

with f₀ ≠ f_R. The value of f₀ can be estimated (for each star) with a simple procedure, outlined in Appendix A. Then, we can write each j-th average in Eq. (7) as

$\begin{matrix} \frac{f_{R}}{f_{0}} \cdot p (P_{j} | D_{j}) + \frac{1 - f_{R}}{1 - f_{0}} \cdot (1 - p (P_{j} | D_{j})), \end{matrix}$ $\begin{aligned} \frac{{f_R}}{f_0} \cdot p(\mathcal{P} _j {\,\vert \,} \mathcal{D} _j) + \frac{1 - {f_R}}{1 - f_0} \cdot (1 - p(\mathcal{P} _j {\,\vert \,} \mathcal{D} _j)), \end{aligned}$ (9)

where p(𝒫_j | 𝒟_j) is the posterior probability of 𝒫_j, obtained simply as the fraction of the posterior samples that fall within the region R. Combining the expression for the total likelihood with a prior for f_R, we obtain the posterior distribution for the occurrence rate given all the individual datasets:

$\begin{matrix} p (f_{R} | {D_{j}}_{S}, {θ_{j}}_{S}) \propto p (f_{R}) \cdot p ({D_{j}}_{S} | f_{R}) . \end{matrix}$ $\begin{aligned} p \left( {f_R}\, \big \vert \, \left\{ \mathcal{D} _j \right\} _S, \left\{ \theta _j \right\} _S \right) \propto p({f_R}) \cdot p \left( \left\{ \mathcal{D} _j \right\} _S\, \big \vert \, {f_R} \right). \end{aligned}$ (10)

The prior p(f_R) can be an uninformative uniform distribution between 0 and 1. The posterior can then be evaluated on a grid of values for f_R. Intuitively, these expressions tell us that when the individual posterior probability within a given region is high compared to what is expected from the prior, the occurrence rate within that region is likely to be high.

2.4. Implementation in kima

In this work, we used the kima package (Faria et al. 2018)³ to analyse the RV datasets. By default, the code models the RVs with a sum of Keplerian functions, sampling from the joint posterior distribution with the Diffusive Nested Sampling algorithm (DNS; Brewer et al. 2011), which also provides the evidence 𝒵. In addition, DNS supports sampling in the trans-dimensional models mentioned before (Brewer & Donovan 2015), in which the number of planets is a free parameter. After an appropriate prior is defined, the posterior distribution for N_p is estimated together with the other model parameters.

The resulting samples from the posterior distributions for several stars can be directly used to compute the posterior for the occurrence rate, as described in the previous section. Even if the calculation is relatively simple, we provide a Python implementation of the method for convenience⁴.

3. Simulations

To demonstrate the proposed approach, we generated a total of 50 synthetic RV datasets (from 50 hypothetical stars), each containing between 40 and 50 measurements, drawn uniformly over a timespan of one year. We define three regions of parameter space on which we want to estimate the occurrence rate:

R₁: P ∈ (2, 25) d; m ∈ (3, 30) M_⊕ with assumed f_R = 0.3,
R₂: P ∈ (60, 100) d; m ∈ (50, 200) M_⊕ with assumed f_R = 0.7,
R₃: P ∈ (100, 400) d; m ∈ (1, 10) M_⊕ with assumed f_R = 0.2.

For each star and each of these three regions, we generated one mock planet, with probability equal to the corresponding f_R, and drew its orbital parameters from the prior π(θ), constrained to the region. That is, each RV time series has between zero and three simulated Keplerian signals, with parameters within the three regions.

All mock stars were assumed to have a mass of 1 M_⊙ and zero systemic velocity. We drew RV uncertainties for each measurement from a uniform distribution in variance, between 2² and 5² m² s⁻² and added white Gaussian noise to the RVs following these heteroscedastic variances. No additional noise was added, but a jitter parameter was still used in the analysis. Two examples of the generated datasets are shown in the top panels of Fig. 1.

Fig. 1.

Example of two simulated datasets, containing no planet signal (left) and two planet signals (right). Top: RV time series and random samples from the posterior predictive distribution. Bottom: Posterior distributions for N_p.

We performed the analysis of each dataset with kima, using default settings for the DNS algorithm. The priors used in the analysis are listed in Table B.1 of Appendix B. We note that the prior for the orbital period extends to two years, even though the time spans of the datasets are around one year. The DNS algorithm ran for 100 000 iterations, providing more than 10 000 effective posterior samples for each dataset.

The bottom panels of Fig. 1 show the resulting posteriors for the number of planets in the same two example datasets. On the left, N_p = 0 has the highest posterior probability, pointing to no planet being detected, while in the second dataset, N_p = 1 is the most probable value. Figure 2 shows the joint posterior for the planetary minimum mass and the orbital period, combining all values of N_p, together with the three regions. The dataset shown on the right contained two simulated planets, within R₂ and R₃. Only one of these signals is detected, appearing as an over-density on the posterior inside R₂.

Fig. 2.

Joint posteriors for the minimum mass and orbital period, shown as hexbin density plots, for the same example datasets as in Fig. 1. All samples with N_p > 0 are shown. The three regions where the occurrence rates are estimated are shown as coloured boxes and labels.

In Fig. 3, we show the occurrence rate estimates for the three regions, as obtained by combining the analysis of five, 25, and all 50 datasets. The plots show the assumed uniform prior and the resulting posterior distribution for each f_R, as well as the true simulated values. It is clear that the analysis of only five datasets provides little information about the occurrence rates, even though it already assigns low posterior probability to f_R = 0 and f_R = 1 in R₂, for example.

Fig. 3.

Occurrence rate estimates for the three regions R₁, R₂, and R₃, from the analysis of five, 25, and all 50 simulated datasets. The panels show the prior and posterior distributions (dotted and solid lines, respectively), and the true simulated values (vertical dashed lines). Each panel also shows the mean and standard deviation of the posterior distribution. The prior has a mean and a standard deviation of 0.5 and 0.28.

Including more stars improves the precision of the posterior estimates, but differently for each region. This is discussed in more detail in Appendix C. Since none of the simulated datasets is sensitive to planets in R₃, with long orbital periods and small masses, the estimate for the occurrence rate is uncertain, and the posterior is very close to the uniform prior.

4. Discussion

We presented a method for estimating the occurrence rate of planets in a given region of parameter space, from the analysis of RV data from several stars. Using samples from the posterior distributions of the individual analyses, we estimated the probability that at least one planet is present within a region limited in orbital period and planetary mass. This only required computing the fraction of the posterior samples that fall within the region and comparing it to the expected fraction from the prior.

Based on 50 simulated RV datasets, we inferred the occurrence rates for three regions, defined in such a way as to probe different RV detection sensitivities. In region R₂, all simulated signals should be easy to detect, even with only ∼50 RV measurements (cf. the example in Fig. 1). On the other hand, none of the simulated time series are sensitive to planet signals in R₃. Region R₁ was chosen to represent an intermediate case, where some signals may be detected.

We recovered unbiased estimates of the occurrence rates for the three regions. As more stars were included, the posterior uncertainties decreased, and the estimates became more precise (see Fig. 3). For regions where the RV data were insensitive to planet signals, the posterior remained close to the prior, as expected.

This approach relies solely on existing posterior samples, does not require injection-recovery tests, and bypasses the need to define an explicit detection threshold. Moreover, the same set of posterior samples can be reused to estimate occurrence rates for different regions or stellar subsamples, offering a flexible and computationally efficient framework for population inference.

4.1. Caveats and assumptions

Listing the main assumptions of the proposed method may shed light on its limitations. Violations of these assumptions may lead to underestimated uncertainty or biased estimates of f_R.

To write Eqs. (4) and (5), we assumed that the RV datasets for each star are independent (conditional on knowing f_R). This may not be true in real RV surveys, because of either shared instrumental systematics or the presence of correlations within the stellar population under study. For example, the instrument used to measure the RVs may imprint a residual calibration error in all datasets, or the true planet occurrence may depend (strongly) on stellar parameters such as mass, metallicity, or age.

Our analyses assumed a simple white noise model, and the simulated datasets also only included white noise. Real RV data almost certainly contain other sources of noise, such as instrumental systematics or stellar activity signals. Fortunately, the method allows for more complex noise models to be used in the individual analyses, which may improve the sensitivity of the RV datasets to planets with lower mass, and mitigate the appearance of spurious signals.

We also assumed to have samples representative of the true posterior distributions, even in regions of low probability. In practice, care must be taken to ensure that the sampling algorithm is performed correctly across the parameter space (see Appendix D for some diagnostics).

4.2. Future work

Our method provides a simple way to make inferences about the exoplanet population. The natural next step is to apply it to real RV datasets from ongoing or historical surveys. We also plan to investigate how instrumental systematics and stellar activity influence the inferred occurrence rates, and to mitigate these effects by introducing more flexible noise models in the individual analyses. In addition, our approach can be extended to study planet multiplicity and other orbital characteristics, either by adapting the definition of the regions of interest or changing the statistical model (see Appendix D for details).

More generally, even though we focused on RV analysis in this work, similar approaches could be applied to transit and direct imaging surveys (see e.g. Foreman-Mackey et al. 2014; Bowler et al. 2020), enabling a unified view of the demographics of exoplanets across multiple detection techniques.

¹

But see Appendix D for an alternative definition.

²

In particular, the individual analyses need not consider any specific region R where the occurrence rate is to be estimated.

³

Available at github.com/kima-org/kima.

⁴

See github.com/kima-org/occurrence.

Acknowledgments

We thank the anonymous referee for constructive feedback that improved this manuscript. This research was funded by the Swiss National Science Foundation (SNSF), grant numbers 200020, 205010.

References

Baycroft, T. A., Triaud, A. H. M. J., Faria, J., Correia, A. C. M., & Standing, M. R. 2023, MNRAS, 521, 1871 [Google Scholar]
Bonfils, X., Delfosse, X., Udry, S., et al. 2013a, A&A, 549, A109 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bonfils, X., Lo Curto, G., Correia, A. C. M., et al. 2013b, A&A, 556, A110 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bowler, B. P., Blunt, S. C., & Nielsen, E. L. 2020, AJ, 159, 63 [Google Scholar]
Brewer, B. J. 2014, ArXiv e-prints [arXiv:1411.3921] [Google Scholar]
Brewer, B. J., & Donovan, C. P. 2015, MNRAS, 448, 3206 [NASA ADS] [CrossRef] [Google Scholar]
Brewer, B. J., Pártay, L. B., & Csányi, G. 2011, Stat. Comput., 21, 649 [Google Scholar]
Buchner, J. 2023, Stat. Surv., 17, 169 [NASA ADS] [CrossRef] [Google Scholar]
Cumming, A., Marcy, G. W., & Butler, R. P. 1999, ApJ, 526, 890 [Google Scholar]
Elvira, V., Martino, L., & Robert, C. P. 2022, Int. Stat. Rev., 90, 525 [Google Scholar]
Faria, J. P., Haywood, R. D., Brewer, B. J., et al. 2016, A&A, 588, A31 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Faria, J. P., Santos, N. C., Figueira, P., & Brewer, B. J. 2018, JOSS, 3, 487 [NASA ADS] [CrossRef] [Google Scholar]
Faria, J. P., Suárez Mascareño, A., Figueira, P., et al. 2022, A&A, 658, A115 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Feroz, F., Hobson, M. P., & Bridges, M. 2009, MNRAS, 398, 1601 [NASA ADS] [CrossRef] [Google Scholar]
Feroz, F., Balan, S. T., & Hobson, M. P. 2011, MNRAS, 415, 3462 [Google Scholar]
Figueira, P., Faria, J. P., Silva, A. M., et al. 2025, A&A, 700, A174 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Ford, E. B. 2005, AJ, 129, 1706 [Google Scholar]
Foreman-Mackey, D., Hogg, D. W., & Morton, T. D. 2014, ApJ, 795, 64 [Google Scholar]
Green, P. J. 1995, Biometrika, 82, 711 [CrossRef] [Google Scholar]
Gregory, P. C. 2005, ApJ, 631, 1198 [NASA ADS] [CrossRef] [Google Scholar]
Higson, E., Handley, W., Hobson, M., & Lasenby, A. 2019, Stat. Comput., 29, 891 [Google Scholar]
Hogg, D. W., Myers, A. D., & Bovy, J. 2010, ApJ, 725, 2166 [Google Scholar]
Howard, A. W., Marcy, G. W., Johnson, J. A., et al. 2010, Science, 330, 653 [Google Scholar]
Jeffreys, H. 1939, Theory of Probability (Oxford: Oxford University Press) [Google Scholar]
John, A. A., Collier Cameron, A., Faria, J. P., et al. 2023, MNRAS, 525, 1687 [Google Scholar]
Kass, R. E., & Raftery, A. E. 1995, J. Am. Stat. Assoc., 90, 773 [Google Scholar]
Lillo-Box, J., Figueira, P., Leleu, A., et al. 2020, A&A, 642, A121 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Mayor, M., Marmier, M., Lovis, C., et al. 2011, ArXiv e-prints [arXiv:1109.2497] [Google Scholar]
Rosenthal, L. J., Fulton, B. J., Hirsch, L. A., et al. 2021, ApJS, 255, 8 [NASA ADS] [CrossRef] [Google Scholar]
Skilling, J. 2004, AIP Conf. Proc., 735, 395 [Google Scholar]
Skilling, J. 2006, Bayesian Anal., 1, 833 [Google Scholar]
Standing, M. R., Triaud, A. H. M. J., Faria, J. P., et al. 2022, MNRAS, 511, 3571 [NASA ADS] [CrossRef] [Google Scholar]
Tuomi, M., Jones, H. R., Barnes, J. R., Anglada-Escudé, G., & Jenkins, J. S. 2014, MNRAS, 441, 1545 [NASA ADS] [CrossRef] [Google Scholar]
Wright, J. T., & Gaudi, B. S. 2013, in Planets, Stars and Stellar Systems, eds. T. D. Oswalt, L. M. French, & P. Kalas (Dordrecht: Springer), 489 [Google Scholar]
Zhu, W. 2019, ApJ, 873, 8 [NASA ADS] [CrossRef] [Google Scholar]

Appendix A: Estimating π(𝒫)

The assumed priors for the model parameters define the prior distribution for the event 𝒫 for each star: a Bernoulli distribution with probability f₀ (see Eq. 8). In the uninteresting limit where the priors exclude the region of interest, this probability is zero and the proposed method to estimate the occurrence rate is ill-defined. In other cases, the prior probability can be computed from the separable priors for the orbital period π_P, the semi-amplitude π_K, the eccentricity π_e, and the number of planets π_{N_p}.

In practice, we estimate f₀ using Monte Carlo: we draw a large number of random samples from π_P, π_K, and π_e, calculate the planet mass for each sample using the stellar mass, and count the number of samples that fall within the region of interest. If the cumulative distribution function of π_P is known, the random sampling can be made more efficient (and lower variance) by only sampling periods within the region. Denoting the fraction of samples falling within the region as F, we then calculate

$\begin{matrix} f_{0} = 1 - \sum_{N_{p} = 0}^{N_{p}^{\max}} π_{N_{p}} (N_{p}) \cdot {(1 - F)}^{N_{p}}, \end{matrix}$ $\begin{aligned} f_0 = 1 - \sum _{N_p=\,0}^{N_p^{\,\mathrm {max}}} \pi _{N_p} (N_p) \cdot ( 1 - F )^{N_p} , \end{aligned}$ (A.1)

where N_pmax is the maximum value of N_p (set to 5 in our analyses, see Table B.1). This expression accounts for the “at least one” constraint on the definition of 𝒫. The calculation of f₀ needs to be repeated for each star with the appropriate value for the stellar mass. With modern hardware, obtaining the random samples and calculating Eq. (A.1) can be achieved in about one second.

In Fig. A.1, we sketch this calculation, showing several thousand random samples from the prior distribution (for N_p = 1) in the plane of orbital period and planet minimum mass, together with the three regions R₁, R₂, and R₃ defined in the main text. For our choice of priors, the fraction of samples that fall within the regions and the corresponding values of f₀ are

F ≈ 0.142 and f₀ ≈ 0.295 for R₁
F ≈ 0.022 and f₀ ≈ 0.053 for R₂
F ≈ 0.029 and f₀ ≈ 0.072 for R₃

as obtained from 10,000,000 prior samples.

Fig. A.1.

Procedure to estimate f₀ using random samples from the prior distribution that fall inside each region of interest.

If the prior distribution is assigned directly to planet mass instead of RV semi-amplitude, there may be analytical expressions for all cumulative distributions, allowing for a calculation of f₀ that does not rely on Monte Carlo.

Appendix B: Model parameters and priors

This appendix details the parameters and prior distributions used in the RV analysis. We parametrise each of the N_p Keplerian functions with the orbital period P, the semi-amplitude K, the orbital eccentricity e, the argument of pericentre ω, and the mean anomaly at the epoch M₀. The reference epoch is set to the time of the first observation. The systemic velocity is denoted as v_sys and is fixed to zero for simplicity. A jitter parameter σ_J is used to account for otherwise unmodelled white noise.

Each dataset consists of N observations containing the times t_i, the radial velocity measurements v_i^obs, and the associated uncertainties σ_i. We further define the model radial velocities, v_i, as deterministic functions of the orbital parameters and v_sys. The connections between all the parameters are shown in Fig. B.1 as a probabilistic graphical model.

Fig. B.1.

Probabilistic graphical model for the RV analysis. An arrow between two nodes indicates the direction of conditional dependence. The circled nodes are parameters of the model, whose joint distribution is sampled. The filled node represents the observed RVs. The t_i and σ_i nodes are assumed given and thus fixed, while the v_i are deterministic functions of other nodes. The variables inside the boxes are repeated a given number of times. The dashed line connecting f_R indicates the importance sampling scheme used to estimate the posterior for this parameter.

Prior distributions for each parameter are listed in Table B.1. Note that these same priors were used for sampling the orbital parameters of the simulated planets, while ensuring the values fell within each region of interest.

Table B.1.

Prior distributions used in the RV analysis.

Appendix C: Posterior uncertainties

The usual method to estimate planet occurrence rates from RV surveys relies on binomial statistics. That is, it is assumed that the number of planets detected, k, follows a binomial distribution with parameters S and f_R, using our notation. To account for the different sensitivity of each star, the average completeness within the region R is used to calculate an “effective” number of stars, S_eff, that replaces S in the binomial probability.

Even if not explicitly stated, the inference for f_R within this context is usually based on the beta-binomial distribution: a beta distribution (of which the uniform distribution is a special case) is often assumed as the prior for f_R. The beta is a conjugate prior to the binomial, so the posterior distribution for f_R is also a beta distribution, with parameters

$\begin{matrix} α = k + 1 and β = S - k + 1 \end{matrix}$ $\begin{aligned} \alpha = k + 1 \quad \text{ and} \quad \beta = S - k + 1 \end{aligned}$

where the +1 factors come from the uniform prior.

The standard deviation of the posterior for f_R is maximised if there are ⌊S/2⌋ or ⌈S/2⌉ detections, it is minimised if there are zero or S detections, and depends on the true value of f_R within this range. This is shown in Fig. C.1, as a function of S. As we increase the number of stars in the sample, the posterior uncertainty for f_R will stay within this range. If only S_eff stars have enough sensitivity to detect a planet, the uncertainty in the posterior is larger. Figure C.1 shows a dashed curve for the case where S_eff = S/2 (completeness of 50%).

Fig. C.1.

Standard deviation of the posterior distribution for f_R as a function of the number of stars in the sample S.

Overplotting the standard deviations of the inferred posterior distributions for f_R using our method (see values listed in Fig. 3) shows good agreement with the beta-binomial results. We see two sources of uncertainty being reflected in the posterior: from the number of stars included in the analysis and from the sensitivity of each star’s RV time series.

As expected, each of the three regions we studied is affected differently by these uncertainties. All RV datasets are sensitive to planets in R₂, so the posterior uncertainty comes only from the limited number of stars in the sample. For R₁, only a fraction of the stars are sensitive to planets with these orbital periods and masses, so the posterior uncertainty is larger, and comparable to a completeness fraction of 50%. Inference in region R₃ is mostly controlled by the prior.

In summary, the uncertainty in our inferred occurrence rates is compatible with other methods, but our approach completely avoids having to determine k and the survey completeness.

Appendix D: Generalisations and diagnostics

In section 2.3, the planet occurrence rate was defined as the fraction of stars with at least one planet within a region of parameter space. An alternative definition could be the average number of planets per star within that region (e.g. Zhu 2019). With some modifications, our method can be used to estimate this quantity.

We define the event “there are n_p planets with parameters in a given region”, denoted by n_p itself (not to be confused with N_p), which takes non-negative integer values. For each star, we assume this event follows a Poisson distribution with rate parameter equal to the average number of planets per star within the region, here denoted as λ_R:

$\begin{matrix} p (λ_{R} ; n_{p}) = \frac{λ_{R}^{n_{p}} e^{- λ_{R}}}{n_{p}!} \end{matrix}$ $\begin{aligned} p(\lambda _R; n_p) = \frac{ \lambda _R^{n_p} \, e^{-\lambda _R} }{ n_p! } \end{aligned}$ (D.1)

The priors used in the individual analyses again define a prior probability for n_p, but now it is not within the same distribution family. The original priors were defined for the total number of planets in the system, N_p. The prior probability that n_p out of N_p planets are within R is given by a binomial distribution. After marginalising out N_p, we obtain:

$\begin{matrix} π (F ; n_{p}) = \sum_{N_{p} = 0}^{5} π_{N_{p}} (N_{p}) \cdot (\begin{matrix} N_{p} \\ n_{p} \end{matrix}) \cdot F^{n_{p}} \cdot {(1 - F)}^{N_{p} - n_{p}} \end{matrix}$ $\begin{aligned} \pi (F; n_p) = \sum _{N_p=0}^{5} \pi _{N_p}(N_p) \cdot \left({\begin{array}{c}N_p\\ n_p\end{array}}\right) \cdot F^{n_p} \cdot (1 - F)^{N_p - n_p} \end{aligned}$ (D.2)

where F is the probability of one planet being within R, which was estimated using Monte Carlo (see Appendix A). The upper limit in the sum is set by the maximum value for N_p in the prior.

The ratio p(λ_R; n_p)/π(F; n_p) can then be used to compute the likelihood, analogously to Eq. (7) but using n_p, jk as the number of planets in R for each posterior sample, and thus the posterior for λ_R can be evaluated after defining a prior p(λ_R). As before, a uniform distribution, from 0 to 5, may be a suitable choice.

Some caveats are worth mentioning. The Poisson assumption (D.1) implies that different planets around the same star occur independently, which is likely not true. Moreover, the original prior π(F; n_p) and the new prior p(λ_R; n_p) may be very different for some values of λ_R, leading to high variance in the importance sampling approximation (Eq. 6).

Fig. D.1.

Ratio of effective sample size (ESS) in the importance sampling approximation to the total number of posterior samples vs the fraction of stars with planets f_R (top) and the average number of planets per star λ_R (bottom), for each star and each box. The thick black curves show the mean ratio across all stars.

Fig. D.2.

Results from bootstrap resampling of the posterior samples for the three regions R₁, R₂, R₃, and from the analysis of all 50 simulated datasets. Top: Posterior distributions for the fraction of stars with planets. Bottom: Posteriors for the average number of planets per star. The lines are the same as in Fig. 3. Note that the true values of f_R and λ_R are the same due to the way the simulated datasets were generated.

A useful diagnostic when using importance sampling is the effective sample size (ESS), which can be computed from the ratio of the two priors, usually called the importance weight (e.g. Elvira et al. 2022). The ESS essentially quantifies how many of the posterior samples, obtained based on π(f₀; 𝒫) or π(F; n_p), have substantial weight under the new priors p(f_R; 𝒫) or p(λ_R; n_p). We estimate the ESS for each simulated star as

$\begin{matrix} {ESS}_{j} = \frac{{(\sum_{k = 1}^{M_{j}} w_{jk})}^{2}}{\sum_{k = 1}^{M_{j}} w_{jk}^{2}} \end{matrix}$ $\begin{aligned} \text{ ESS}_j = \frac{ \left( \sum _{k=1}^{M_j} w_{jk} \right)^2 }{ \sum _{k=1}^{M_j} w_{jk}^2 } \end{aligned}$ (D.3)

with the weights defined as

$\begin{matrix} w_{jk} = \frac{p (f_{R} ; P_{jk})}{π (P_{jk})} or w_{jk} = \frac{p (λ_{R} ; n_{p, j k})}{π (n_{p, j k})} . \end{matrix}$ $\begin{aligned} w_{jk} = \frac{ p({f_R}; \mathcal{P} _{jk}) }{ \pi (\mathcal{P} _{jk}) } \quad \text{ or} \quad w_{jk} = \frac{ p(\lambda _R; n_{p,jk}) }{ \pi (n_{p,jk}) } . \end{aligned}$ (D.4)

Figure D.1 shows the ratio of the ESS to the total number of posterior samples for each star and region, as a function of f_R and λ_R. In the case of f_R, the ESS is above 50% of the number of posterior samples, except for values close to 1 for R₁ and above 0.25 for R₃. Therefore, the importance sampling approximation is less accurate in these cases. For R₁, this has a small impact on the results since the posterior for f_R peaks at lower values. The ESS is in general smaller for λ_R, especially for R₃, due to the larger mismatch between the original and new priors.

Another way to diagnose the results is to perform bootstrap resampling, by randomly shuffling (with replacement) the posterior samples for each star and recomputing the occurrence rate posteriors. An example of this is shown in Fig. D.2, which plots the posteriors for f_R and λ_R for each region, obtained from 30 resampling iterations. The posteriors show little variation for R₁ and R₂, but the variance and bias are higher for R₃.

Finally, we note that a perhaps more natural generalisation of the method would be to estimate the individual probabilities for n_p = 0, n_p = 1, and so on, while maybe collapsing all values above a certain number into one bin (e.g. n_p ≥ 2). These individual probabilities could be modelled with a Dirichlet distribution, allowing for a larger flexibility than the Poisson, with few additional parameters. Such a parametrisation might also be more easily compared to empirical probabilities provided by planet formation models.

All Tables

Table B.1.

Prior distributions used in the RV analysis.

In the text

All Figures

	Fig. 1. Example of two simulated datasets, containing no planet signal (left) and two planet signals (right). Top: RV time series and random samples from the posterior predictive distribution. Bottom: Posterior distributions for N_p.
In the text

	Fig. 2. Joint posteriors for the minimum mass and orbital period, shown as hexbin density plots, for the same example datasets as in Fig. 1. All samples with N_p > 0 are shown. The three regions where the occurrence rates are estimated are shown as coloured boxes and labels.
In the text

Fig. 3.

Occurrence rate estimates for the three regions R₁, R₂, and R₃, from the analysis of five, 25, and all 50 simulated datasets. The panels show the prior and posterior distributions (dotted and solid lines, respectively), and the true simulated values (vertical dashed lines). Each panel also shows the mean and standard deviation of the posterior distribution. The prior has a mean and a standard deviation of 0.5 and 0.28.

In the text

	Fig. A.1. Procedure to estimate f₀ using random samples from the prior distribution that fall inside each region of interest.
In the text

Fig. B.1.

Probabilistic graphical model for the RV analysis. An arrow between two nodes indicates the direction of conditional dependence. The circled nodes are parameters of the model, whose joint distribution is sampled. The filled node represents the observed RVs. The t_i and σ_i nodes are assumed given and thus fixed, while the v_i are deterministic functions of other nodes. The variables inside the boxes are repeated a given number of times. The dashed line connecting f_R indicates the importance sampling scheme used to estimate the posterior for this parameter.

In the text

	Fig. C.1. Standard deviation of the posterior distribution for f_R as a function of the number of stars in the sample S.
In the text

	Fig. D.1. Ratio of effective sample size (ESS) in the importance sampling approximation to the total number of posterior samples vs the fraction of stars with planets f_R (top) and the average number of planets per star λ_R (bottom), for each star and each box. The thick black curves show the mean ratio across all stars.
In the text

Fig. D.2.

Results from bootstrap resampling of the posterior samples for the three regions R₁, R₂, R₃, and from the analysis of all 50 simulated datasets. Top: Posterior distributions for the fraction of stars with planets. Bottom: Posteriors for the average number of planets per star. The lines are the same as in Fig. 3. Note that the true values of f_R and λ_R are the same due to the way the simulated datasets were generated.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Baycroft, T. A., Triaud, A. H. M. J., Faria, J., Correia, A. C. M., & Standing, M. R. 2023, MNRAS, 521, 1871 [Google Scholar]

[2] Bonfils, X., Delfosse, X., Udry, S., et al. 2013a, A&A, 549, A109 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[3] Bonfils, X., Lo Curto, G., Correia, A. C. M., et al. 2013b, A&A, 556, A110 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[4] Bowler, B. P., Blunt, S. C., & Nielsen, E. L. 2020, AJ, 159, 63 [Google Scholar]

[5] Brewer, B. J. 2014, ArXiv e-prints [arXiv:1411.3921] [Google Scholar]

[6] Brewer, B. J., & Donovan, C. P. 2015, MNRAS, 448, 3206 [NASA ADS] [CrossRef] [Google Scholar]

[7] Brewer, B. J., Pártay, L. B., & Csányi, G. 2011, Stat. Comput., 21, 649 [Google Scholar]

[8] Buchner, J. 2023, Stat. Surv., 17, 169 [NASA ADS] [CrossRef] [Google Scholar]

[9] Cumming, A., Marcy, G. W., & Butler, R. P. 1999, ApJ, 526, 890 [Google Scholar]

[10] Elvira, V., Martino, L., & Robert, C. P. 2022, Int. Stat. Rev., 90, 525 [Google Scholar]

[11] Faria, J. P., Haywood, R. D., Brewer, B. J., et al. 2016, A&A, 588, A31 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[12] Faria, J. P., Santos, N. C., Figueira, P., & Brewer, B. J. 2018, JOSS, 3, 487 [NASA ADS] [CrossRef] [Google Scholar]

[13] Faria, J. P., Suárez Mascareño, A., Figueira, P., et al. 2022, A&A, 658, A115 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[14] Feroz, F., Hobson, M. P., & Bridges, M. 2009, MNRAS, 398, 1601 [NASA ADS] [CrossRef] [Google Scholar]

[15] Feroz, F., Balan, S. T., & Hobson, M. P. 2011, MNRAS, 415, 3462 [Google Scholar]

[16] Figueira, P., Faria, J. P., Silva, A. M., et al. 2025, A&A, 700, A174 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[17] Ford, E. B. 2005, AJ, 129, 1706 [Google Scholar]

[18] Foreman-Mackey, D., Hogg, D. W., & Morton, T. D. 2014, ApJ, 795, 64 [Google Scholar]

[19] Green, P. J. 1995, Biometrika, 82, 711 [CrossRef] [Google Scholar]

[20] Gregory, P. C. 2005, ApJ, 631, 1198 [NASA ADS] [CrossRef] [Google Scholar]

[21] Higson, E., Handley, W., Hobson, M., & Lasenby, A. 2019, Stat. Comput., 29, 891 [Google Scholar]

[22] Hogg, D. W., Myers, A. D., & Bovy, J. 2010, ApJ, 725, 2166 [Google Scholar]

[23] Howard, A. W., Marcy, G. W., Johnson, J. A., et al. 2010, Science, 330, 653 [Google Scholar]

[24] Jeffreys, H. 1939, Theory of Probability (Oxford: Oxford University Press) [Google Scholar]

[25] John, A. A., Collier Cameron, A., Faria, J. P., et al. 2023, MNRAS, 525, 1687 [Google Scholar]

[26] Kass, R. E., & Raftery, A. E. 1995, J. Am. Stat. Assoc., 90, 773 [Google Scholar]

[27] Lillo-Box, J., Figueira, P., Leleu, A., et al. 2020, A&A, 642, A121 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]

[28] Mayor, M., Marmier, M., Lovis, C., et al. 2011, ArXiv e-prints [arXiv:1109.2497] [Google Scholar]

[29] Rosenthal, L. J., Fulton, B. J., Hirsch, L. A., et al. 2021, ApJS, 255, 8 [NASA ADS] [CrossRef] [Google Scholar]

[30] Skilling, J. 2004, AIP Conf. Proc., 735, 395 [Google Scholar]

[31] Skilling, J. 2006, Bayesian Anal., 1, 833 [Google Scholar]

[32] Standing, M. R., Triaud, A. H. M. J., Faria, J. P., et al. 2022, MNRAS, 511, 3571 [NASA ADS] [CrossRef] [Google Scholar]

[33] Tuomi, M., Jones, H. R., Barnes, J. R., Anglada-Escudé, G., & Jenkins, J. S. 2014, MNRAS, 441, 1545 [NASA ADS] [CrossRef] [Google Scholar]

[34] Wright, J. T., & Gaudi, B. S. 2013, in Planets, Stars and Stellar Systems, eds. T. D. Oswalt, L. M. French, & P. Kalas (Dordrecht: Springer), 489 [Google Scholar]

[35] Zhu, W. 2019, ApJ, 873, 8 [NASA ADS] [CrossRef] [Google Scholar]