HBC: Hierarchical Bayes Compiler
Pre-release version 0.5 (15 Nov 2007); see what has been updated here)
Older versions:
0.4
0.3
0.2
0.1
HBC is a toolkit for implementing hierarchical Bayesian models. HBC
was created because I felt like I spend too much time writing
boilerplate code for inference problems in Bayesian models. There are
several goals of HBC:
- Allow a natural implementation of hierarchal models.
- Enable quick and dirty debugging of models for standard data types.
- Focus on large-dimension discrete models.
- More general that simple Gibbs sampling (eg., allowing for
maximizations, EM and message passing).
- Allow for hierarchical models to be easily embedded in larger
programs.
- Automatic Rao-Blackwellization (aka collapsing).
- Allow efficient execution via compilation to other languages (such
as C, Java, Matlab, etc.).
These goals distinguish HBC from other Bayesian modeling software,
such as Bugs (or WinBugs). In
particular, our primary goal is that models created in HBC can be used
directly, rather than only as a first-pass test. Moreover, we aim for
scalability with respect to data size. Finally, since the goal of HBC
is to compile hierarchical models into standard programming
languages (like C), these models can easily be used as part of a
larger system. This last point is in the spirit of the dynamic
programming language Dyna.
Note that some of these aren't yet supported (in particular: some of 4
and full support for 6) but should be coming soon!
A Quick Example
To give a flavor of what HBC is all about, here is a complete
implementation of a Bayesian mixture of Gaussians model in HBC format:
alpha ~ Gam(10,10)
mu_{k} ~ NorMV(vec(0.0,1,dim), 1) , k \in [1,K]
si2 ~ IG(10,10)
pi ~ DirSym(alpha, K)
z_{n} ~ Mult(pi) , n \in [1,N]
x_{n} ~ NorMV(mu_{z_{n}}, si2) , n \in [1,N]
If you are used to reading hierarchical models, it should be quite
clear what this model does. Moreover, by keeping to a very LaTeX-like
style, it is quite straightforward to automatically typeset any
hierarchical model. If this file were stored in
mix_gauss.hier, and if we had data for x stored in a
file called X, we could run this model (with two Gaussians)
directly by saying:
hbc simulate --loadM X x N dim --define K 2 mix_gauss.hier
Perhaps closer to my heart would be a six-line implementation of the
Latent Dirichlet Allocation model, complete with hyperparameter
estimation:
alpha ~ Gam(0.1,1)
eta ~ Gam(0.1,1)
beta_{k} ~ DirSym(eta, V) , k \in [1,K]
theta_{d} ~ DirSym(alpha, K) , d \in [1,D]
z_{d,n} ~ Mult(theta_{d}) , d \in [1,D] , n \in [1,N_{d}]
w_{d,n} ~ Mult(beta_{z_{d,n}}) , d \in [1,D] , n \in [1,N_{d}]
This code can either be run directly (eg., by a simulate
call as above) or compiled to native C code for (much) faster
execuation.
User's Guide
Can be downloaded in Adobe Acrobat format.
Mailing List
There is a mailing
list set up locally for HBC-related discussions. You can
subscribe by visiting the previous link. I anticipate the mailing
list will be reasonably low volume. I'll announce updates, and you
can feel free to post questions, comments or bugs.
Distribution
You can download either the source code as a tar bundle or a Linux executable, also as
a tar bundle. You can build the
source using GHC by saying: ghc --make -fglasgow-exts Main -o hbc.
If you're on Debian/Ubuntu, be sure you have the
libghc6-mtl-dev package, or you'll get an error about Control.Monad.State.
Both distribution forms include sample hierarchical models and data
for the following models (plus some more):
The source, executables and examples are all completely free for any
purpose whatsoever.
Questions, Comments and Bugs
Please email me directly with supporting information at , or subscribe to the mailing
list.
Updates
New in Version 0.5
- (Minor update) Many bug fixes, especially with respect to
marginalization.
- (Minor update) --loadD can now load 3 and 4
dimensional data sets; see LDA4d.hier (included in the
distribution) for an example.
- (Minor update) A large amount of code optimization is now
enabled; many constant operations are precomputed and double loops no
longer incur double memory (and time) overheads.
New in Version 0.4
- (Minor update) Collection of bug fixes, mostly to do with
Markovization and some other low-level type inference details.
- (Minor update) --define and --load? can
now take command-line arguments for compiled code. See
LDA.hier for an example, but, eg., --loadD arg_{1} w V D N ;
means to use the first command line argument to load w.
Bad things will happen if you try to use this with simulate.
New in Version 0.3
- (Major update) Bug fixes for doubly indexed variables.
I.e., things like a_{b_{c}}. These were hardly working if at
all. Many more things are possible with this fixes.
- (Minor update) --dump now works with compiled
code (with the exception of best dumping) and so you can
actually see what the compled sampler is doing.
- (Minor udpate) You can elect to maximize some variables
(instead of sample) by saying --maximize VAR.
- (Minor update) Now comes with an implementation of IBM model 1 for machine translation.
New in Version 0.2
- (Major update) Dirichlet/Multinomial pairs can now be
marginalized out! This means that you can, for instance, obtain the
"collapsed Gibbs sampler" for LDA. To marginalize out a
multinomial variable called, say, theta, you need only
specify "--collapse theta" as an argument to
hbc.
- (Minor update) The generated C code now keeps track of the
best sample thus far and displays an asterix whenever a better sample
is encountered.
- (Minor update) Command-line options can be specified
directly in .hier source files. Simply begin a line with
--# (which would otherwise be treated as a comment) and you
can just write out any command-line option; see LDA.hier for an example.