<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML">
<link rel="stylesheet" type="text/css" href="../../../CSS_FULL/edps_full.css">
<body><div id="contenu_olm">

<!-- DOI: 10.1051/0004-6361/200810097 -->

<h2 class="sec">Online Material</h2>

<p>

<h2 class="sec"><a name="SECTION000100000000000000000"></a>
Appendix A: Statistical evaluation of the model
</h2>

<p>

<h3 class="sec2"><a name="SECTION000101000000000000000"></a>
A.1. Univariate tests on individual planet characteristics
</h3>

<p>
<A NAME="table:mean_sigma"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table6.html"><span class="bold">Table 6:</span></a>&#160;&#160;
 Mean values and standard deviations of the system parameters for the observed
transiting planets and our simulated detections.</p>
<p>
In this section, we detail the statistical method and tests
  that have been used to validate the model. We first perform basic
tests of our model with simulations repeating multiple times  the number of observations of the
OGLE survey in order to get 50&nbsp;000 detections.  This number was chosen as a compromise between
  statistical significance and computation time.
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:mean_sigma">6</a> compares the mean values and standard
variations in the observations and in the simulations. The closeness
of the values obtained for the two populations is an indication that
our approach provides a reasonably good fit to the real stellar and
planetary populations, and to the real planet compositions and
evolution.  

<p>
However, we do require more advanced statistical tests. First, we
use the so-called Student's t-test to formally compare the mean
values of all characteristics for both types of planets. The
intuition is that, should the model yield simulated planets of
attributes similar to real planets, the average values of these
attributes should not be significantly different from one another.
In other words, the so-called null hypothesis <I>H</I><SUB>0</SUB> is that the
difference of their mean is zero. Posing <I>H</I><SUB>0</SUB>: 
<!-- MATH: $\mu^{r}-\mu^{s}=0$ -->
<IMG
 WIDTH="67" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img101.png"
 ALT="$\mu^{r}-\mu^{s}=0$">where superscripts <I>r</I> and <I>s</I> denote real and simulated planets
respectively, and the alternative hypothesis <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img102.png"
 ALT="$H_{\rm a}$">
being the
complement <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img102.png"
 ALT="$H_{\rm a}$">:

<!-- MATH: $\mu^{r}-\mu^{s}\neq0$ -->
<IMG
 WIDTH="67" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img103.png"
 ALT="$\mu^{r}-\mu^{s}\neq0$">,
we compute the <I>t</I>statistics using the first and second moments of the distribution of
each planet characteristics as follows:
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
t = \frac{{\left( {\mu_x^{r}  - \mu_x^{s} } \right)}}{{\frac{{s_{\rm p}
}}{{\sqrt {n_{r}  + n_{s}} }}}},
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><IMG
 WIDTH="88" HEIGHT="79"
 SRC="img104.png"
 ALT="\begin{displaymath}t = \frac{{\left( {\mu_x^{r} - \mu_x^{s} } \right)}}{{\frac{{s_{\rm p}
}}{{\sqrt {n_{r} + n_{s}} }}}},
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(2)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
where <I>x</I> is each of the planet characteristics, <I>n</I> is
the size of each sample, and <IMG
 WIDTH="15" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img105.png"
 ALT="$s_{\rm p}$">
is the square root of the pooled
variance accounting for the sizes of the two population
samples<A NAME="tex2html24"
 HREF="#foot2894"><sup><IMG  ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="/icons/foot_motif.png"></sup></A> 
The statistics follows a <I>t</I>distribution, from which one can easily derive the two-tailed
critical probability  that the two samples come from one unique
population of planets, i.e. <I>H</I><SUB>0</SUB> cannot be rejected.  
The results are displayed in Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:agreement2">7</a> (Note that
<IMG
 WIDTH="10" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img3.png"
 ALT="$\theta $">
is the Safronov number; other parameters have their usual
meaning). In all cases, the probabilities are greater than
40%, implying that there is no significant difference in the mean
characteristics of both types of planets. In other words, the two
samples exhibit similar central tendencies.

<p>
Next, we perform the Kolmogorov-Smirnov test to allow for a more
global assessment of the compatibility of the two populations. This
test has the advantage of being non-parametric, making no assumption
about the distribution of data. This is particularly important since
the number of real planets remains small, which may alter the
normality of the distribution. Moreover, the Kolmogorov-Smirnov
comparison tests the stochastic dominance of the entire
distribution of real planets over simulated planets. To do so, it
computes the largest absolute deviations <I>D</I> between <I>F</I><SUB><I>r</I></SUB>(<I>x</I>), the
empirical cumulative distribution function of characteristics <I>x</I>for real planets, and <I>F</I><SUB><I>s</I></SUB>(<I>x</I>) the cumulative distribution
function of characteristics <I>x</I> for simulated planets, over the
range of values of <I>x</I>: 
<!-- MATH: $D = \mathop {\max }\limits_x \left\{
{\left| {F_{\rm real} \left( x \right) - F_{\rm sim} \left( x \right)}
\right|} \right\}$ -->
<IMG
 WIDTH="173" HEIGHT="27" ALIGN="MIDDLE" BORDER="0"
 SRC="img107.png"
 ALT="$D = \mathop {\max }\limits_x \left\{
{\left\vert {F_{\rm real} \left( x \right) - F_{\rm sim} \left( x \right)}
\right\vert} \right\}$">.
If the calculated <I>D</I>-statistic is greater than
the critical <I>D</I><sup>*</sup>-statistic (provided by the Kolmogorov-Smirnov table: for 31
observations <I>D</I><sup>*</sup>=0.19 for a 80% confidence level and <I>D</I><sup>*</sup>=0.24 for a
95% confidence level), then one
must reject the null hypothesis that the two distributions are
similar, 
<!-- MATH: $H_0: | F_{r}(x)-F_{s}(x) | <D^*$ -->
<I>H</I><SUB>0</SUB>: | <I>F</I><SUB><I>r</I></SUB>(<I>x</I>)-<I>F</I><SUB><I>s</I></SUB>(<I>x</I>) | &lt;<I>D</I><sup>*</sup>, and accept 
<!-- MATH: $H_{\rm a}:
| F_{r}(x)-F_{s}(x) | \geq D^*$ -->
<IMG
 WIDTH="144" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img108.png"
 ALT="$H_{\rm a}:
\vert F_{r}(x)-F_{s}(x) \vert \geq D^*$">.
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:agreement">8</a> shows the
result of the test. The first column provides the D-Statistics, and
the second column gives the probability that the two samples have
the same distribution.

<p>
Again, we  find a good match between the model and observed samples:
the parameters that have the least satisfactory fits are the
planet's equilibrium temperature and the planet mass.  
These values are interpreted as being due to
imperfections in the assumed star and planet populations. It is
important to stress that although the extrasolar planets' main characteristics
(period, mass) are well-defined by radial-velocity surveys, the
subset of transiting planets is highly biased towards short periods
and corresponds to a relatively small sample of the known
radial-velocity planet population. This explains why the probability
that the planetary mass is drawn from the same distribution in the
model and in the observations is  relatively low, which may otherwise seem
surprising given that the planet mass distribution would be expected
to be relatively well defined by the radial-velocity measurements.

<p>

<h3 class="sec2"><a name="SECTION000102000000000000000"></a><A NAME="sec:2D"></A>
A.2. Tests in two dimensions
</h3>

<p>
Tests of the adequation of observations and models in two
dimensions, i.e. when considering one parameter compared to
another one can be performed using the method of maximum likelihood
as described in Paper&nbsp;I. Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:likelihood">9</a> provides
values of the standard deviations from maximum likelihood for
important combinations of parameters.  The second column is a
comparison using all planets discovered by transit surveys, and the third
 column using all known transiting planets (including those discovered by radial
 velocity).

<p>
The results are generally good, with deviations not exceeding

<!-- MATH: $1.82\sigma$ -->
<IMG
 WIDTH="38" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img109.png"
 ALT="$1.82\sigma$">.
They are also very similar when considering all
planets or only the subset discovered by photometric surveys. This
shows that the radial-velocity and photometric planet characteristics
are quite similar. The mass vs. radius relation shows the highest deviation,
as a few planets are outliers of our planetary evolution model.

<p>

<h3 class="sec2"><a name="SECTION000103000000000000000"></a>
A.3. Multivariate assessment of the performance of&nbsp;the&nbsp;model
</h3>

<p>

<h4 class="sec3"><a name="SECTION000103100000000000000"></a>
A.3.1. Principle
</h4>

<p>
Tests such as the Student-t statistics and the Kolmogorov-Smirnov test
are important to determine the adequacy of given parameters, but
they do not provide a multivariate assessment of the model. In order
to globally assess the viability of our model we proceed as follows:
We generate a list including  50&nbsp;000 ``simulated'' planets and
the 31  ``observed''  giant planets from
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:transiting_planets">1</a>.   
  This number is necessary for an accurate multi-variate analysis (see Sect.&nbsp;A.3.2).
A dummy variable <I>Y</I> is generated with value 
  1 if the planet is observed, 0 if the planet is simulated.

<p>
In order to test dependencies between parameters, we have presented in
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:all_correlations">3</a> (Sect.&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#sec:stat">2.4</a>) the Pearson
correlation coefficients between each variable including <I>Y</I>.  A first
look at the table shows that the method correctly retrieves the
important physical correlations without any a priori information
concerning the links that exist between the different parameters.  For
example, the stellar effective temperature 
<!-- MATH: $T_{\rm eff}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img17.png"
 ALT="$T_{\rm eff}$">
is positively
correlated to the stellar mass <IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img12.png"
 ALT="$M_{\star}$">,
and radius <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img36.png"
 ALT="$R_{\star}$">.
It is
also naturally positively correlated to the planet's equilibrium
temperature 
<!-- MATH: $T_{\rm eq}$ -->
<IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img37.png"
 ALT="$T_{\rm eq}$">,
and to the planet's radius <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img7.png"
 ALT="$R_{\rm p}$">simply because evolution models predict planetary radii that are
larger for larger values of the irradiation, all parameters being
equal. Interestingly, it can be seen that although the Safronov number
is by definition correlated to the planetary mass, radius, orbital
period and star mass (see Eq.&nbsp;(<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#eq:safronov">1</a>)), the largest
correlation parameters  for <IMG
 WIDTH="10" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img3.png"
 ALT="$\theta $">
in absolute value are those related to <IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img11.png"
 ALT="$M_{\rm p}$">and <I>P</I> (as the range of both these parameters vary by more
  than one decade, while <IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img12.png"
 ALT="$M_{\star}$">
and <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img7.png"
 ALT="$R_{\rm p}$">
only vary by a factor of
  2). Also, we observe that the star metallicity is only correlated
to the planet radius. This is a consequence of our assumption that a
planet's heavy element content is directly proportional to the star's
[Fe/H], and of the fact that planets with more heavy elements are
smaller, all other parameters being equal. The planet's radius is
itself correlated negatively with [Fe/H] and positively with 
<!-- MATH: $T_{\rm
eq}$ -->
<IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img37.png"
 ALT="$T_{\rm eq}$">,
<IMG
 WIDTH="22" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img12.png"
 ALT="$M_{\star}$">,<IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img36.png"
 ALT="$R_{\star}$">
and 
<!-- MATH: $T_{\rm eff}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img17.png"
 ALT="$T_{\rm eff}$">.
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:all_correlations">3</a> also shows the correlations with
the ``reality'' parameter. Of course, a satisfactory model is one in
which there is no correlation between this reality parameter and other
physical parameters of the model. In our case, the corresponding
correlation coefficients are always small and indicate a good match
between the two populations.

<p>
<A NAME="table:agreement2"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table7.html"><span class="bold">Table 7:</span></a>&#160;&#160;
Test of equality of means. Student's <I>t</I> value and critical
probabilities <I>p</I> that individual parameters for both real and simulated
planets have the same sample mean.</p>
<p>
<A NAME="table:agreement"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table8.html"><span class="bold">Table 8:</span></a>&#160;&#160;
Kolmogorov-Smirnov tests. <I>D</I>-statistics and critical
probabilities that individual parameters for both real and simulated
planets have the same distribution.</p>
<p>
<A NAME="table:likelihood"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table9.html"><span class="bold">Table 9:</span></a>&#160;&#160;
Standard deviations from maximum likelihood of the
model and  observed transiting planet populations</p>
<p>
Obviously the unconditional probability that a given planet is real
is 
<!-- MATH: $\Pr(Y=1)=31/50~031\simeq.00062$ -->
<IMG
 WIDTH="187" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img110.png"
 ALT="$\Pr(Y=1)=31/50~031\simeq.00062$">.
Now we wish to know whether this
probability is sensitive to any of the planet characteristics,
controlling for all planet characteristics at once. Hence we model
the probability that a given planet is  ``real'' using the logistic
cumulative density function as follows:
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
\Pr(Y = 1|{\vec{X}_i}) = \frac{{{\rm e}^{{\vec{X}_i\vec{b} }}
}}{{1 + {\rm e}^{{\vec{X}_i\vec{b} }} }}
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><A NAME="eq:logistic"></A><IMG
 WIDTH="149" HEIGHT="70"
 SRC="img111.png"
 ALT="\begin{displaymath}
\Pr(Y = 1\vert{\vec{X}_i}) = \frac{{{\rm e}^{{\vec{X}_i\vec{b} }}
}}{{1 + {\rm e}^{{\vec{X}_i\vec{b} }} }}
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(4)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
where  <IMG
 WIDTH="17" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img112.png"
 ALT="$\vec{X}_i$">
is the vector of explanatory
variables (i.e. planet characteristics) for the planet <I>i</I> (real or
simulated), and <IMG
 WIDTH="11" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img113.png"
 ALT="${\vec b}$">
is the vector of parameter to be
estimated, and 
<!-- MATH: $\vec{X}_i\vec{b}\equiv b_0 + \sum_j X_{ij}b_j$ -->
<IMG
 WIDTH="115" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img114.png"
 ALT="$\vec{X}_i\vec{b}\equiv b_0 + \sum_j X_{ij}b_j$">,
and <I>b</I><SUB>0</SUB> is a
constant. There are <I>n</I> events to be considered (<I>i</I>=1..<I>n</I>) and <I>m</I>explanatory variables (<I>j</I>=1..<I>m</I>).  

<p>
Importantly, an ordinary least square estimator should not be used in
this framework, due to the binary nature of the dependent
variables. Departures from normality and predictions outside the
range [0;1] are the quintessential motivations. Instead, Eq.&nbsp;(<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#eq:logistic">4</a>) can be estimated using maximum likelihood
methods. The so-called logit specification (<A NAME="aaref21"></A><a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#Greene_2000">Greene  2000</a>) fits
the parameter estimates <IMG
 WIDTH="11" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img113.png"
 ALT="${\vec b}$">
so as to maximize the log
likelihood function:
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
\log L({\vec{Y}}|{\vec{X,{b}}}) = \sum\limits_{i = 1}^n
{y_i\ {{\vec{X}_i\vec{b} }}}  - \sum\limits_{i = 1}^n {\log
\left[ {1 + {\rm e}^{{\vec{X}_i\vec{b} }} } \right]}.
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><IMG
 WIDTH="286" HEIGHT="79"
 SRC="img115.png"
 ALT="\begin{displaymath}\log L({\vec{Y}}\vert{\vec{X,{b}}}) = \sum\limits_{i = 1}^n
{...
...^n {\log
\left[ {1 + {\rm e}^{{\vec{X}_i\vec{b} }} } \right]}.
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(5)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
The <IMG
 WIDTH="31" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img116.png"
 ALT="$\log L$">
function is then maximized
choosing 
<!-- MATH: $\vec{\hat{b} }$ -->
<IMG
 WIDTH="11" HEIGHT="33" ALIGN="MIDDLE" BORDER="0"
 SRC="img117.png"
 ALT="$\vec{\hat{b} }$">
such that  
<!-- MATH: ${\partial \log
L(y_i,{\vec{X}_i,\vec{\hat{b}}})}/{\partial\vec{\hat{b} }}=0$ -->
<IMG
 WIDTH="138" HEIGHT="33" ALIGN="MIDDLE" BORDER="0"
 SRC="img118.png"
 ALT="${\partial \log
L(y_i,{\vec{X}_i,\vec{\hat{b}}})}/{\partial\vec{\hat{b} }}=0$">,
using a
Newton-Raphson algorithm. The closer the coefficients

<!-- MATH: $\hat{b}_1,\hat{b}_2,..,\hat{b}_m$ -->
<IMG
 WIDTH="68" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
 SRC="img119.png"
 ALT="$\hat{b}_1,\hat{b}_2,..,\hat{b}_m$">
are to 0, the closer the model is
to the observations. Conversely, a coefficient that is significantly
different from zero tells us that there is a correlation between this
coefficient and the probability of a planet being ``real'', i.e. the
model is not a good match to the observations. 

<p>
Two features of logistic regression using
maximum likelihood estimators are important. First, the
value added by the exercise is that the multivariate approach allows
us to hold all other planet characteristics constant, extending the
bivariate correlations to the multivariate case. In other words, we
control for all planet characteristics at once. Second, one can test
whether a given parameter estimate is equal to 0 with the usual null
hypothesis <I>H</I><SUB>0</SUB>: <I>b</I>=0 versus <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img102.png"
 ALT="$H_{\rm a}$">:
<IMG
 WIDTH="33" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img120.png"
 ALT="${b}\neq 0$">.
The variance of
the estimator<A NAME="tex2html28"
 HREF="#foot2882"><sup><IMG  ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="/icons/foot_motif.png"></sup></A> is used
to derive the standard error of the parameter estimate. 
Using Eq.&nbsp;(<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#likelihood">6</a>), dividing each variable

<!-- MATH: $\hat{{b}}_j$ -->
<IMG
 WIDTH="15" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
 SRC="img122.png"
 ALT="$\hat{{b}}_j$">
by the standard error 
<!-- MATH: ${\rm s.e.}(\hat{{b}}_j)$ -->
<IMG
 WIDTH="44" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
 SRC="img123.png"
 ALT="${\rm s.e.}(\hat{{b}}_j)$">
yields
the t-statistics and allows us to test <I>H</I><SUB>0</SUB>. We note 
<!-- MATH: ${\cal P}_j$ -->
<IMG
 WIDTH="17" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img124.png"
 ALT="${\cal P}_j$">
the probability that a higher value of <I>t</I> would occur by chance. This probability is evaluated for each explanatory variable
  <I>j</I>.  Should our model perform well, we would expect the <I>t</I> value
of each parameter estimate to be null, and the corresponding
probability 
<!-- MATH: ${\cal P}_j$ -->
<IMG
 WIDTH="17" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img124.png"
 ALT="${\cal P}_j$">
to be close to one. This would imply no
significant association between a single planet characteristics and
the event of being a ``real'' planet.

<p>
The global probability that the model and
observations are compatible can be estimated. To do so, we compute 
the log likelihood obtained when <I>b</I><SUB><I>j</I></SUB>=0for <I>j</I>=1..<I>m</I>, where <I>m</I> is the number of variables. Following
Eq.&nbsp;(<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#likelihood">6</a>):
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
\log L({\vec{Y}}|1,b_0) = \sum\limits_{i = 1}^n
{y_i b_0}  - \sum\limits_{i = 1}^n {\log
\left[ {1 + {\rm e}^{b_0}} \right]}.
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><A NAME="likelihood"></A><IMG
 WIDTH="267" HEIGHT="79"
 SRC="img125.png"
 ALT="\begin{displaymath}
\log L({\vec{Y}}\vert 1,b_0) = \sum\limits_{i = 1}^n
{y_i b_...
...um\limits_{i = 1}^n {\log
\left[ {1 + {\rm e}^{b_0}} \right]}.
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(6)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
The maximum of this quantity is 
<!-- MATH: $\log
L_0=n_0\log(n_0/n)+n_1\log(n_{1}/n)$ -->
<IMG
 WIDTH="204" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img126.png"
 ALT="$\log
L_0=n_0\log(n_0/n)+n_1\log(n_{1}/n)$">,
where <I>n</I><SUB>0</SUB> is the number of
cases in which <I>y</I>=0 and <I>n</I><SUB>1</SUB> is the number of observations with
<I>y</I>=1. <I>L</I><SUB>0</SUB> is thus the maximum likelihood obtained for a model which
is in perfect agreement with the observations (no explanatory variable
is correlated to the probability of being real).
Now, it can be shown that the likelihood statistic ratio
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
c_{\rm LL}=2 (\log L_1 - \log L_0)
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><IMG
 WIDTH="150" HEIGHT="49"
 SRC="img127.png"
 ALT="\begin{displaymath}c_{\rm LL}=2 (\log L_1 - \log L_0)
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(7)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
follows a <IMG
 WIDTH="17" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
 SRC="img5.png"
 ALT="$\chi ^2$">
distribution for a number of degrees of freedom <I>m</I>when the null hypothesis is true (<A NAME="aaref1"></A><a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#Aldrich_1984">Aldrich &amp; Nelson 1984</a>). The
probability that a sum of <I>m</I> normally distributed random variables
with mean 0 and variance 1 is larger than a value&nbsp;
<!-- MATH: $c_{\rm LL}$ -->
<IMG
 WIDTH="21" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img128.png"
 ALT="$c_{\rm LL}$">&nbsp;is:
<br><p></p>
<DIV ALIGN="CENTER">

<!-- MATH: \begin{equation}
{\cal P}_{\chi^2}= P(m/2,c_{\rm LL}/2),
\end{equation} -->

<TABLE WIDTH="100%" ALIGN="CENTER">
<TR VALIGN="MIDDLE"><TD ALIGN="CENTER" NOWRAP><IMG
 WIDTH="133" HEIGHT="51"
 SRC="img129.png"
 ALT="\begin{displaymath}{\cal P}_{\chi^2}= P(m/2,c_{\rm LL}/2),
\end{displaymath}"></td>
<TD WIDTH=10 ALIGN="RIGHT">
(8)</td></tr>
</TABLE>
</DIV><BR CLEAR="ALL"><p></p>
where <I>P</I>(<I>k</I>,<I>z</I>) is the regularized Gamma function (e.g. <A NAME="aaref0"></A><a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#Abramowitz_1964">Abramowitz &amp; Stegun (1964)</a>). 

<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">
is thus the probability that the model
planets and the observed planets are drawn from the same
distribution. 

<p>

<h4 class="sec3"><a name="SECTION000103200000000000000"></a><A NAME="sec:number_required"></A>
A.3.2. Determination of the number of model planets required
</h4>

<p>
A problem that arose in the course of the present work was to evaluate
the number of model planets that were needed for the logit evaluation.
It is often estimated that about 10&nbsp;times more model points than
observations are sufficient for a good tests. We found that
this relatively small number of points indeed leads to a valid
identification of the explanatory variables that are problematic,
i.e. those for which the <IMG
 WIDTH="10" HEIGHT="32" ALIGN="MIDDLE" BORDER="0"
 SRC="img130.png"
 ALT="$\hat{b}$">
coefficient is significantly 
different from 0 (if any). However, the evaluation of the global
<IMG
 WIDTH="17" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
 SRC="img5.png"
 ALT="$\chi ^2$">&nbsp;probability was then found to show considerable statistical
variability, probably given the relatively large number of explanatory
variables used for the study. 

<p>
In order to test how the probability 
<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">
depends on the size
n of the sample to be analyzed, we first generated a very large list
of <I>N</I><SUB>0</SUB> simulated planets with CoRoTlux. We generated with
Monte-Carlo simulations a smaller subset of 
<!-- MATH: $n_0\le N_0$ -->
<IMG
 WIDTH="48" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img131.png"
 ALT="$n_0\le N_0$">
simulated planets that
was augmented by the <I>n</I><SUB>1</SUB>=31 observed planets and computed 
<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">using the logit procedure. This exercise was performed 1000&nbsp;times, and the results are 
shown in Fig.&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#fig:test_logit">13</a>. The resulting 
<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">
is
found to be very variable for a sample smaller than <IMG
 WIDTH="13" HEIGHT="14" ALIGN="BOTTOM" BORDER="0"
 SRC="img13.png"
 ALT="$\sim$">20&nbsp;000&nbsp;planets. As a consequence, we chose to present tests performed for

<!-- MATH: $n_0=50~000$ -->
<I>n</I><SUB>0</SUB>=50&nbsp;000 model planets. 

<p>
<div class="inset-old">
<table>
<tr><td><!-- init Label --><A NAME="fig:test_logit">&#160;</A><!-- end Label--><A NAME="2854"></A><A NAME="figure2261"
 HREF="img132.png"><IMG
 WIDTH="72" HEIGHT="50" SRC="Timg132.png"
 ALT="\begin{figure}
\par\includegraphics[width=6.5cm,clip]{10097f14.eps}
\end{figure}"></A><!-- HTML Figure number: 13 --></td>
<td class="img-txt"><span class="bold">Figure 13:</span><p>
Values of the <IMG
 WIDTH="17" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
 SRC="img5.png"
 ALT="$\chi ^2$">
probability, 
<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">
(see
  text) obtained after a logit analysis as a function of the size of
  the sample of model planets <I>n</I><SUB>0</SUB>.</p></td>
</tr><tr><td colspan="2"><a href="http://dexter.edpsciences.org/applet.php?pdf_id=13&DOI=10.1051/0004-6361/200810097" target="DEXTER">Open with DEXTER</a></td></tr>

</table></div>
<p>

<h4 class="sec3"><a name="SECTION000103300000000000000"></a>
A.3.3. Analysis of two CoRoTlux samples
</h4>

<p>
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:logitres">4</a> (see Sect.&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#sec:stat">2.4</a>) reports the
parameter estimates for each of the planet/star characteristics. We
start by assessing the general quality of the logistic regression by
performing the chi-square test. If the vector of planet
characteristics brings no or little information as to which type of
planets a given observation belongs, we would expect the logistic
regression to perform badly. In technical terms, we would expect the
conditional probability 
<!-- MATH: $\Pr(Y = 1|{\vec{X}})$ -->
<IMG
 WIDTH="71" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img133.png"
 ALT="$\Pr(Y = 1\vert{\vec{X}})$">
to be equal to the
unconditional probability 
<!-- MATH: $\Pr(Y = 1)$ -->
<IMG
 WIDTH="57" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img134.png"
 ALT="$\Pr(Y = 1)$">.
The <IMG
 WIDTH="17" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
 SRC="img5.png"
 ALT="$\chi ^2$">
test described
above is used to evaluate the significance of the
model. 

<p>
We performed several tests: the first column of results in
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:logitres_10_columns">10</a> shows the result of a logit
analysis with the whole series of 9 explanatory variables. Globally,
the model behaves well, with a likelihood statistic ratio 
<!-- MATH: $c_{\rm
LL}=5.8$ -->
<IMG
 WIDTH="57" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img135.png"
 ALT="$c_{\rm
LL}=5.8$">
and a <IMG
 WIDTH="17" HEIGHT="30" ALIGN="MIDDLE" BORDER="0"
 SRC="img5.png"
 ALT="$\chi ^2$">
distribution for 9 degrees of freedom
yielding a probability 
<!-- MATH: ${\cal{P}}_{\chi^2}=0.758$ -->
<IMG
 WIDTH="71" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img136.png"
 ALT="${\cal{P}}_{\chi^2}=0.758$">.
When examining
individual variables, we find that the lowest probability derived from
the Student test is that of [Fe/H]: 
<!-- MATH: ${\cal{P}}_{\rm {\rm [Fe/H]}}=0.164$ -->
<IMG
 WIDTH="90" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img137.png"
 ALT="${\cal{P}}_{\rm {\rm [Fe/H]}}=0.164$">,
implying that the stellar metallicity is not well reproduced. As
discussed previously, this is due to the fact that several planets of
the observed list have no or very poorly constrained determinations of
the stellar [Fe/H], and so a default value of 0 was then used. 

<p>
The other columns in Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:logitres_10_columns">10</a> show the
result of the logit analysis when removing one variable (i.e. with
only 8 explanatory variables). In agreement with the above analysis,
the highest global probability 
<!-- MATH: ${\cal{P}}_{\chi^2}$ -->
<IMG
 WIDTH="23" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img6.png"
 ALT="${\cal {P}}_{\chi ^2}$">
is obtained for the
model without the [Fe/H] variable. When removing other variables, the
results are very homogeneous, indicating that although the model can
certainly be improved, there is no readily identified problem except
that for [Fe/H]. We hope that future observations will allow for
better constraints on these stars' metallicities.

<p>
In order to further test the method, we show in
Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:logitres_altered">11</a> the results of an analysis in which
the model radii where artificially augmented by 10%. The
corresponding probabilities are significantly lower: we
find that the model can explain the observations by chance only in
less than 1/10&nbsp;000. The probabilities for each variable are affected
as well so that it is impossible to identify the culprit for the bad
fit with the 9 variables. However, when removing <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img7.png"
 ALT="$R_{\rm p}$">
from the
analysis sample, the fit becomes significantly better. Note that
the results for that column are slightly different of those for the
same column in Table&nbsp;<a href="/articles/aa/full_html/2009/35/aa10097-08/aa10097-08.html#table:logitres_10_columns">10</a> because of the
dependance of <IMG
 WIDTH="10" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img3.png"
 ALT="$\theta $">
on <IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img7.png"
 ALT="$R_{\rm p}$">.

<p>
<A NAME="table:logitres_10_columns"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table10.html"><span class="bold">Table 10:</span></a>&#160;&#160;
Results of the logit analysis for the fiducial model with
  50&nbsp;000 model planets and 31 observations.</p>																							

<p>
<A NAME="table:logitres_altered"></A><p class="inset-old"><a href="/articles/aa/full_html/2009/35/aa10097-08/table11.html"><span class="bold">Table 11:</span></a>&#160;&#160;
Results of the logit analysis for the altered model (<IMG
 WIDTH="18" HEIGHT="26" ALIGN="MIDDLE" BORDER="0"
 SRC="img7.png"
 ALT="$R_{\rm p}$">
increased by 10%) with 50&nbsp;000 model planets and 31
  observations.</p>		

<p>
<br>

</div></body></html>