%%
%% This is file `template-6s.tex',
%% generated with the docstrip utility.
%%
%% The original source files were:
%%
%% template.raw (with options: `6s')
%%
%% Template for the LaTeX class aipproc.
%%
%% (C) 1998,2000,2001 American Institute of Physics and Frank Mittelbach
%% All rights reserved
%%
%%
%% $Id: template.raw,v 1.8 2002/06/02 15:34:15 frank Exp $
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Please remove the next line of code if you
%% are satisfied that your installation is
%% complete and working.
%%
%% It is only there to help you in detecting
%% potential problems.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%\input{aipcheck}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% SELECT THE LAYOUT
%%
%% The class supports further options.
%% See aipguide.pdf for details.
%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[
,final % use final for the camera ready runs
%% ,draft % use draft while you are working on the paper
%% ,numberedheadings % uncomment this option for numbered sections
%% , % add further options here if necessary
]
{aipproc}
\layoutstyle{6x9}
\newcommand{\IntOver}[1]{
\raisebox{-1.2ex}{$ \begin{array}{l}
\displaystyle{\int} \\
\scriptscriptstyle{#1}
\end{array} $ }
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% FRONTMATTER
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{document}
\title{The Volume of Bitnets}
\author{Carlos C. Rodr\'{\i}guez}{
address={The University at Albany, SUNY \\
Department of Mathematics and Statistics \\
\url{http://omega.albany.edu:8008/bitnets}}
}
\begin{abstract}
A bitnet is a dag of binary nodes representing a manifold of probability
distributions for a sequence of binary variables. Bitnets are
riemannian manifolds of finite volume when Fisher information is used as
the metric.
I compute the exact volumes for several interesting topologies including
the complete graph, the directed line of $n$ nodes,
the exploding star (naive bayes) and its reverse, the collapsing star.
I show a fast algorithm for approximating the volumes of general bitnets.
Computer experiments show that the average Ricci scalar of bitnets is
always half an integer. Typically bitnets have singular boundaries
obtained when some of the conditional probabilities for the binary
variables are either zero or one.
At the singular points the Ricci scalar becomes negative infinity.
\end{abstract}
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% MAINMATTER
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
A bitnet, is a regular statistical model for a sequence of binary variables.
The joint probabilities of the sequence are efficiently
described by a directed acyclic graph (dag) whose vertices are the variables
and whose directed edges indicate stochastic influence. Figure
(\ref{fig:bitnets}) shows four examples of bitnets of 3 variables.
The assumption of a network of stochastic influences for the
variables (i.e. a bitnet) allows the following useful factorization of
joint probabilities,
\begin{equation}
p(x_{V}) = \prod_{i\in V} p(x_{i} | x_{pa(i)}). \label{eq:I1}
\end{equation}
Where $V$ is the set of vertices of the bitnet, $pa(i)\subset V$ are the
indices of the parents of $x_{i}$ and if $A \subset V$ we
denote by $x_{A}$ the set of bits with indeces in $A$.
It follows from (\ref{eq:I1}) that all the joint probabilities are obtained
from the values of $p(x_{i}|x_{pa(i)})$. There are $2^{|pa(i)|}$ possible
values for $x_{pa(i)}$ and therefore we need that many independent parameters
in $(0,1)$ for the $i$th node. Thus, the total number of independent
parameters in $(0,1)$ necessary to specify all the joint probabilities of
a bitnet is,
\begin{equation}\label{eq:I2}
d = \sum_{i\in V} 2^{|pa(i)|}.
\end{equation}
For example, the dimensions of the four bitnets in figure (\ref{fig:bitnets})
are (from left to right): $7, 5, 5$, and $6$.
This paper studies the geometry and topology of the hypothesis space of
all the probability distributions of a sequence of binary variables that
satisfy the network of stochastic dependencies encoded by a bitnet.
By (\ref{eq:I1}), there is a one to one map between
the objects in this set and the points in the $d$ dimensional
open unit cube. We call this kind of hypothesis space a bitnet model.
\begin{figure}[htb!]
\ifx\pdftexversion\undefined
\includegraphics[scale=0.35]{dags}
\else
\includegraphics[angle=-90,scale=0.4]{dags}
\fi
\caption{Four examples of bitnets. From left to right are:
a) $\vec{K}_{3}$ the complete dag of 3. b) $\vec{L}_{3}$ the directed
line of 3. c) $\vec{E}_{3}$ the exploding star of 3. d) $\vec{C}_{3}$,
the collapsing star of 3.}
\label{fig:bitnets}
\end{figure}
\section{The Metric of a Bitnet Model}
When $|V|=n$ the information metric, (i.e., Fisher information matrix)
has $n^{2}$ components,
\begin{equation}\label{eq:II1}
g_{ij}(\theta) = E(l_{i}(X)l_{j}(X) | \theta)
\end{equation}
with,
\begin{equation}\label{eq:II2}
l_{k}(x) = \frac{\partial{\log p_{\theta}(x)}}
{\partial{\theta_{k}}}
\end{equation}
where $\theta = (\theta_{1},\ldots,\theta_{d})$ is the vector of parameters
in the $d$-cube. In order to obtain a simple expression for the metric
and the volume of a bitnet, we need to introduce some notation.
If $pa(i) = \{j_{1},j_{2},\ldots,j_{k}\}$ with $j_{1}0$ and $\sum t_{i}^{2} = 1$ by simply
using the standard $t_{i} = \sqrt{p_{i}}$ transformation. If follows
immediately from this considerations that the information volume of the
complete bitnet is the same as that of the multinomial which turns out
(by using the standard $\sqrt{p_{i}}$ transformation)
to be exactly half the volume of the unit $d$-sphere $S^{d}$, i.e.,
\begin{equation}\label{eq:III13}
\mbox{vol}(S^{d}) = \frac{2 \pi^{\frac{d+1}{2}}}{\Gamma(\frac{d+1}{2})}.
\end{equation}
Thus,
\begin{equation}\label{eq:III14}
\mbox{vol}(\vec{K}_{n}) = \frac{\pi^{2^{n-1}}}{(2^{n-1}-1)!}
\end{equation}
This sequence becomes exponentially small very quickly,
\begin{eqnarray}\label{eq:III15}
\mbox{vol}(\vec{K}_{n}) &=&
\pi,\pi^{2},\frac{\pi^{4}}{6},\frac{\pi^{8}}{5040},\ldots
\sim \frac{1}{\sqrt{2\pi}}\left(\frac{\pi}{k}\right)^{k} e^{-k}\\
&=& 3.14, 9.86, 16.2, 1.87, 0.0000683, \ldots \nonumber
\end{eqnarray}
where the asymptotic expression is for $k=2^{(n-1)}$ as $n\rightarrow\infty$.
The Ricci scalar is, not surprisingly, constant and it can be obtained
from the corresponding value for the multinomial model:
\begin{equation}\label{eq:III16}
\mbox{Ricci}(\vec{K}_{n}) = \frac{d(d-1)}{4} = \frac{(2^{n}-1)(2^{n-1}-1)}{2}.
\end{equation}
\subsection{The Directed Line: $\vec{L}_{n}$}
The $n$ nodes are connected linearly, i.e., one is the parent of two who
is the parent of three,\ldots, who is the parent of $n$. The dimension
is $d= 1+(n-1)2= 2n-1$. The exact volume becomes very difficult to compute
for values of $n\geq 4$ but computer experiments carried with
the aid of {\em vTool}\footnote{vTool is a maple module available at
\url{http://omega.albany.edu:8008/bitnets/}.} show that,
\begin{equation}\label{eq:volL}
\mbox{vol}(\vec{L}_{n}) = 4 \left(\frac{\pi}{2}\right)^{3n-4}.
\end{equation}
The first two cases $n\in\{1,2\}$ are trivial (they are also complete bitnets)
so the values of $\pi$ and $\pi^{2}$ are inmediatly obtained. The case
$n=3$ is already not easy (maple can't do it alone) but a special change of
variables in 3D shows that this volume must be the same as the volume
of $\vec{E}_{2+1}$ which is $\frac{\pi^{5}}{8}$
(see below equation (\ref{eq:III17})). Many other values
of $n$ have been estimated by Monte Carlo and they provide strong evidence
to the validity of (\ref{eq:volL}).
The Ricci scalar for the case $n=3$ is computed with {\em vTool} as,
\begin{equation}\label{eq:L2}
R = 5 - \frac{1}{2\rho}
\end{equation}
where $\rho$ is the variance of the central node. I believe that to
hold in general, i.e., that the scalar curvature always depends on the
variance of the central nodes. By looking at the components of the
Riemann tensor it is possible to show that the scalar curvature is always
independent of the parameters of the leave nodes.
\subsection{The Exploding Star: $\vec{E}_{n+1}$}
One parent node with $n$ children. This is what the machine learning
community calls Na{\"{i}}ve Bayes. The children bits
(e.g., the observed symptoms) are assumed to be independent conditionally
on the parent (the presence or absence of the disease). This bitnet is
by far the most popular bayesian network due to its simplicity and to
its surprising accuracy and robustness. As I argue below, their volumes
can be used to explain, at least in part, their success in applications.
The volume of $\vec{E}_{n+1}$ is easily obtained from the general formula
(\ref{eq:II10.2}),
\begin{equation}\label{eq:III17}
\mbox{vol}(\vec{E}_{n+1}) = \int \left[ \frac{1}{\rho_{1}}
\frac{\rho_{1}}{\rho_{2}\rho_{3}}\frac{\rho_{1}}{\rho_{4}\rho_{5}}
\cdots \frac{\rho_{1}}{\rho_{2n}\rho_{2n+1}} \right]^{1/2}\ \ d\theta
\end{equation}
separating the variables by Fubini, using (\ref{eq:II10.5}) and defining,
\begin{equation}\label{eq:B}
B(r) = \int_{0}^{1} t^{r/2} (1-t)^{r/2}\ dt = \int \rho_{i}^{r/2} d\theta
= \frac{\Gamma(\frac{r}{2}+1)^{2}}{\Gamma(r+1)}
\end{equation}
we obtain,
\begin{equation}\label{eq:III18}
\mbox{vol}(\vec{E}_{n+1}) = \pi^{2n}\ B(n-1).
\end{equation}
The sequence of volumes explodes exponentially,
\begin{eqnarray}\label{eq:III19}
\mbox{vol}(\vec{E}_{n+1}) &=& \pi^{2}, \frac{\pi^{5}}{8}, \frac{\pi^{6}}{6},
\cdots \sim \sqrt{2\pi n} \left(\frac{\pi^{2}}{2}\right)^{n} \\
&=& 9.87, 38.25, 160.24, 698.69, 3121.6,\ldots \nonumber
\end{eqnarray}
The computation of the scalar curvature was obtained with the help of
{\em vTool}. It depends only on the variance
$\rho=\rho_{1}$ associated to the parent (center) node,
\begin{equation}\label{eq:III20}
\mbox{Ricci}(\vec{E}_{n+1}) = R(\rho) = \frac{n}{2}\left[
(2n+1) - \frac{n-1}{2\rho} \right] \leq \frac{3}{2}n
\end{equation}
As $\rho \rightarrow 0$ the scalar curvature diverges:
$R(\rho)\rightarrow -\infty$.
The boundary $\partial\vec{E}_{n+1}$, consisting of all the models with
coordinates $\theta\in [0,1]^{d}$ with at least one component
$\theta_{j}=0$, contains the surface $\rho=0$ of singularities.
This has an easy explanation. The estimation of the parameters of the
model obviously becomes more difficult when the variance of the parent
node is close to zero. If there is only one in a billion chance of
observing a given disease, we need billions of observations to estimate
the probabilities with a Na{\"{i}}ve Bayes model. In the limit of zero variance
we are at an inferential black hole. No information (about {\em any} of
the parameters of the model) can {\em ever} be extracted from data.
Notice that the scalar curvature does not depend on the parameters of
leave nodes. That is always the case for all bitnets.
The average scalar curvature,$$, is computed by integrating $R$ given by
(\ref{eq:III20}), with respect to the volume element of $\vec{E}_{n+1}$
and dividing by the total volume (\ref{eq:III18}). We obtain:
\begin{equation}\label{eq:III21}
\left< R \right> = \frac{n}{2}.
\end{equation}
\subsection{The Collapsing Star: $\vec{C}_{n+1}$}
This bitnet corresponds to the one child of a promiscuous mother!
i.e., $n$ parent nodes with only one child. It has dimension $d=2^{n}+n$
and its volume is computed straight forwardly from (\ref{eq:II10.2}),
\begin{equation}\label{eq:III22}
\mbox{vol}(\vec{C}_{n+1}) = \int \left\{ \frac{1}{\rho_{1}}\frac{1}{\rho_{2}}
\cdots \frac{1}{\rho_{n}} \frac{[\rho_{1}\rho_{2}\cdots \rho_{n}]^{2^{n-1}}}
{\rho_{n+1}\cdots \rho_{n+2^{n}}} \right\}^{1/2}\ d\theta
\end{equation}
integrating out the parameters of the $2^{n}$ leave nodes, using Fubini and
the function $B$ defined in (\ref{eq:B}) we can write,
\begin{equation} \label{eq:III23}
\mbox{vol}(\vec{C}_{n+1}) = \pi^{2^{n}} B(2^{n-1}-1)^{n}
\end{equation}
With the aid of {\em vTool} we can show that the Ricci scalar has the form,
\begin{equation}\label{eq:III24}
R = a - b\left(\frac{1}{\rho_{1}}+\frac{1}{\rho_{2}}+\cdots +
\frac{1}{\rho_{n}} \right)
\end{equation}
where $(a,b)$ depend only on $n$. The only known values are for
$n=1,2,3,4$. They are $(a,b)= (3/2,0), (10,1/2), (54,3), (272,14)$ with
average Ricci scalars $= 3/2, 2, 6, ?$. Equation (\ref{eq:III24})
tells us that, as geometric objects, a child node with only one parent
is radically different than a child node with two or more parents. When
$n=1$ the space has constant scalar curvature but when $n\geq 2$ the
curvature diverges to minus infinity as we let the variance of any of the
parent nodes to go to zero. So what's so different about the cases
$n=1$ and $n=2$? What does curvature really mean statistically?
I think what makes the cases $n=1$ and $n=2$ different is that with
only two nodes we can reverse the arrow, obtaining a bitnet which is
markov equivalent to the original one (same V structures) but when
$n=2$ reversing arrows produces a non markov equivalent bitnet. Thus,
with only one parent and one child, if the parent has a variance very
close to zero the apparent singularity disappears by the reparametrization
implied by the reversing of the arrow making the parent a leave node. Being
a true geometric invariant, the information contained in the Ricci scalar
holds in all possible descriptions (reparametrizations, markov equivalence
transformations, etc..) of the model and it must be telling us something
significant about the difficulty of estimation at each point.
\section{Bounds for General Volumes}
For most large bitnets the exact computation of their volumes becomes
impractical. This section shows cook-book formulas for a lower and an
upper bound for the volumes of general bitnets. These bounds can be
shown to be exact for complete bitnets and for bitnets with maximum
depth of $1$ (i.e., bitnets without grand parents). The geometric mean
between the lower and the upper bound has been observed to perform
remarkably well in large simulation studies \cite{Lauria2003}.
I claim that, the exact volume $Z$ of a general bitnet is always bounded
between $L$ and $U$, i.e., $L\leq Z\leq U$ where, the upper and lower
bounds are given in terms of the function $B$ defined in (\ref{eq:B}),
by
\begin{equation}\label{eq:Lbound}
L = \prod_{i=1}^{n} B^{m(i)}(a_{i})
\end{equation}
with
\begin{equation}\label{eq:ai}
a_{i} = -1 + \sum_{j\in ch(i)} 2^{|pa(j)|-|pa(i)|-1}
\end{equation}
and
\begin{equation}\label{eq:Ubound}
U = \prod_{i=1}^{n} B^{m(i)}(b_{i})
\end{equation}
with $b_{i}$ being the same as (\ref{eq:ai}) except that now the sum does not
run over all the children of $i$, but only over those children $j$ of $i$
for which $pa(j)\subset pa(i)$.
\section{The Importance of Volumes}
Why should we care about the volumes of bitnets?
One answer is model selection.
On the one hand we want our models to be sufficiently large
to be able to approximate the true distribution closely. On the other
hand we only have limited resources and we need small models that can
be handled efficiently. There are many ways to measure the size
of a model but, not surprisingly, the information volume is explicitly
showing up in the formulas. Consider, for example, the latest expression
for the minimum description length (MDL) criterion for model selection
\cite{Rissanen96,Balasubramanian96},
\begin{equation}\label{eq:IV1}
\mbox{MDL} = - \sum_{i=1}^{N}\log p(y_{i}|\hat{\theta}) +
\frac{d}{2} \log\frac{N}{2\pi} + \log V
\end{equation}
where $N$ is the sample size, $(y_{1},\ldots,y_{N})$ is the observed data,
$\hat{\theta}$ is the MLE of the vector of parameters, $d$ is the dimension
of the model (number of free parameters) and $V$ is the information volume
of the model. The MDL approximates the length of the best possible code for
the observed data that can be built with the help of the model. According to
the MDL criterion, the best model is the one that allows the shortest possible
encoding of the observations \cite{Rissanen96}.
It so happens that (\ref{eq:IV1}) is also the $o(1)$ approximation
to $-\log P(M|y)$ (see \cite{Balasubramanian96,Rodriguez2003}) i.e., minus
the logarithm of the posterior probability of the model $M$. Thus, on the
basis of the observed data alone, with no other prior information at hand,
given the choice between two models of the same dimensionality providing
the same fit to the data we must prefer the one with smaller volume $V$.
One must notice however, that the volume only appears as the third term
(of order $1=N^{0}$) of the asymptotic expansion (as $N\rightarrow\infty$).
The first term ($-\log\ $ likelihood fit) scales linearly with $N$, the
second term scales linearly in the number of parameters $d$ but
logarithmically on the sample size $N$. Thus, for sufficiently large $N$,
the volume term will eventually become negligible but exactly {\it when}
it does become negligable depends on the specific models being considered.
Simulation studies \cite{Lauria2003} show improvements in model selection
of the order of $28\%$ on the average, when the expression (\ref{eq:IV1})
is used instead of the traditional MDL without the volume term.
The modern expression for the MDL (\ref{eq:IV1}) exemplifies the natural
desirability of models with small volume. Desirability for large volumes
is naturally found on measures of generalization power. As just one example
consider the expression for the generalization power of heat kernels
\cite{LaffertyLebanon2004},
\begin{equation}\label{eq:IV2}
\log {\cal N}(\epsilon,{\cal F}_{R}({\bf x})) = O\left(
\left(\frac{V}{t^{\frac{d}{2}}}\right)
\log^{\frac{d+2}{2}}\left(\frac{1}{\epsilon}\right) \right)
\end{equation}
where ${\cal N}(\epsilon,{\cal F}_{R}({\bf x}))$ denotes the size of the
smallest $\epsilon$-cover of the space ${\cal F}_{R}({\bf x})$ which is
the ball of radius $R$ (in the $\sup$-norm) of functions defined on the
data ${\bf{x}}$ in terms of a heat kernel $K_{t}$. The specific technicalities
of this method of estimation are not very important for us here. The main
point is that for a given accuracy of estimation ($\epsilon$),
between two competing models
of the same dimension $d$, we must choose the one with the largest information
volume $V$ in order to increase generalization power.
\begin{figure}[htb!]
\includegraphics[scale=0.40]{ELNpic}
\caption{Complexity of $\vec{E}_{n}$ and $\vec{L}_{n}$ for different sample
sizes $N=100,500,1000$. The magnitude of curves increases with $N$ and
Explode > Line}
\label{fig:ELNpic}
\end{figure}
\begin{figure}[htb!]
\includegraphics[scale=0.40]{CNpic}
\caption{Complexity of $\vec{C}_{n}$ for different sample sizes $N=100,500,1000$.
The magnitude of curves increases with $N$}
\label{fig:CNpic}
\end{figure}
\section{Complexity}
Let us separate the expression (\ref{eq:IV1}) as,
\begin{equation}\label{eq:V1}
\mbox{MDL} = \mbox{Fit} + \mbox{Complexity}
\end{equation}
where by ``Complexity'' we simply mean the sum of the last two terms
in (\ref{eq:IV1}). These are the terms that do not involve the
observed data, only their number $N$, the dimension of the model $d$,
and its volume $V$. Figure (\ref{fig:ELNpic}) shows the complexity
terms for the $\vec{E}_{n}$ and $\vec{L}_{n}$ bitnets for binary
sequences of different lengths $n<12$ and for three sample sizes
$N=100, 500, 1000$. The picture is clear. The complexities (actually
just their volumes for they have the same $d$) of the exploding star
and the line are very similar straight lines with the exploding star
always above the line. Complexity increases with both, the size of
the network $n$ and the sample size $N$. This means that by adding
more leaves (symptoms) to a Na\"{\i}ve Bayes network we increase its
complexity and by increasing its volume we (probably) increase its
power of generalization.
On the other hand figure (\ref{fig:CNpic}) shows a very different behavior for
the $\vec{C}_{n}$ bitnet. The complexity reaches a maximum saturation point and
then decreases without bound. Thus, adding more parents may help increase
generalization power but after crossing the saturation size the bitnet will
loose complexity and probably its power to generalize correctly as well.
I do believe that there is more to the story of model complexity than what's
available from (\ref{eq:IV1}). Recall that (\ref{eq:IV1}) is only the
first three terms of an asymptotic expansion. The neglected terms contain the
model curvatures (see \cite{Rodriguez2003}) and by neglecting them we are
failing to account for the difficulty of extracting information out of the
sample due to the presence of high model curvatures. One may define model complexity
in pure geometric terms as $$ and then try to characterize the models of extreme
complexity with data and in the vacuum, without data. This type of variational
problem has been very successful at describing observational data in physics
and a lot is already known about it.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% BACKMATTER
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\begin{theacknowledgments}
%\end{theacknowledgments}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% You may have to change the BibTeX style below, depending on your
%% setup or preferences.
%%
%% If the bibliography is produced without BibTeX comment out the
%% following lines and see the aipguide.pdf for further information.
%%
%% For The AIP proceedings layouts use either
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliographystyle{aipproc} % if natbib is available
%\bibliographystyle{aipprocl} % if natbib is missing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% You probably want to use your own bibtex database here
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\bibliography{carlos}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Just a reminder that you may have to run bibtex
%% All of it up to \end{document} can be removed
%% if you don't like the warning.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\IfFileExists{\jobname.bbl}{}
{\typeout{}
\typeout{******************************************}
\typeout{** Please run "bibtex \jobname" to optain}
\typeout{** the bibliography and then re-run LaTeX}
\typeout{** twice to fix the references!}
\typeout{******************************************}
\typeout{}
}
\end{document}
\endinput
%%
%% End of file `template-6s.tex'.