HBC: Hierarchical Bayes Compiler

Pre-release version 0.7 (03 Apr 2008); see what has been updated here)
Older versions: 0.6 0.5 0.4 0.3 0.2 0.1

HBC is a toolkit for implementing hierarchical Bayesian models. HBC was created because I felt like I spend too much time writing boilerplate code for inference problems in Bayesian models. There are several goals of HBC:

Allow a natural implementation of hierarchal models.
Enable quick and dirty debugging of models for standard data types.
Focus on large-dimension discrete models.
More general that simple Gibbs sampling (eg., allowing for maximizations, EM and message passing).
Allow for hierarchical models to be easily embedded in larger programs.
Automatic Rao-Blackwellization (aka collapsing).
Allow efficient execution via compilation to other languages (such as C, Java, Matlab, etc.).
Support for non-parametric models.

These goals distinguish HBC from other Bayesian modeling software, such as Bugs (or WinBugs). In particular, our primary goal is that models created in HBC can be used directly, rather than only as a first-pass test. Moreover, we aim for scalability with respect to data size. Finally, since the goal of HBC is to compile hierarchical models into standard programming languages (like C), these models can easily be used as part of a larger system. This last point is in the spirit of the dynamic programming language Dyna.

Note that some of these aren't yet supported (in particular: some of 4 and full support for 6) but should be coming soon!

A Quick Example

To give a flavor of what HBC is all about, here is a complete implementation of a Bayesian mixture of Gaussians model in HBC format:

  alpha     ~ Gam(10,10)
  mu_{k}    ~ NorMV(vec(0.0,1,dim), 1, dim)     , k \in [1,K]
  si2       ~ IG(10,10)
  pi        ~ DirSym(alpha, K)
  z_{n}     ~ Mult(pi)                          , n \in [1,N]
  x_{n}     ~ NorMV(mu_{z_{n}}, si2, dim)       , n \in [1,N]

If you are used to reading hierarchical models, it should be quite clear what this model does. Moreover, by keeping to a very LaTeX-like style, it is quite straightforward to automatically typeset any hierarchical model. If this file were stored in mix_gauss.hier, and if we had data for x stored in a file called X, we could run this model (with two Gaussians) directly by saying:

  hbc simulate --loadM X x N dim --define K 2 mix_gauss.hier

Perhaps closer to my heart would be a six-line implementation of the Latent Dirichlet Allocation model, complete with hyperparameter estimation:

  alpha     ~ Gam(0.1,1)
  eta       ~ Gam(0.1,1)
  beta_{k}  ~ DirSym(eta, V)           , k \in [1,K]
  theta_{d} ~ DirSym(alpha, K)         , d \in [1,D]
  z_{d,n}   ~ Mult(theta_{d})          , d \in [1,D] , n \in [1,N_{d}]
  w_{d,n}   ~ Mult(beta_{z_{d,n}})     , d \in [1,D] , n \in [1,N_{d}]

This code can either be run directly (eg., by a simulate call as above) or compiled to native C code for (much) faster execuation.

User's Guide

Can be downloaded in Adobe Acrobat format. (Note: this is an old user's guide.)

Mailing List

There is a mailing list set up locally for HBC-related discussions. You can subscribe by visiting the previous link. I anticipate the mailing list will be reasonably low volume. I'll announce updates, and you can feel free to post questions, comments or bugs.

Distribution

You can download either the source code as a tar bundle or a Linux executable, also as a tar bundle. You can build the source using GHC by saying: ghc --make -fglasgow-exts Main -o hbc. If you're on Debian/Ubuntu, be sure you have the libghc6-mtl-dev package, or you'll get an error about Control.Monad.State. Both distribution forms include sample hierarchical models and data for the following models (plus some more):

The source, executables and examples are all completely free for any purpose whatsoever.

Questions, Comments and Bugs

Please email me directly with supporting information at

, or subscribe to the mailing list.

Acknowledgments

Many people have been extremely helpful in the development of HBC. Thanks go out to: Chia-Hui Chang, Robbie Haertel, David Hall, David Ellis, Aleks Jakulin, David Mimno, Michael Braun.

Updates

New in Version 0.7

(Major update) Support for Pitman-Yor priors; implements (roughly) the No-Gaps algorithm when parameters are not collapsed and (roughly) Neal's algorithm 8 when parameters are collapsed.

New in Version 0.6

(Major update) Massive speedups in C based on caching of vector sums; non-collapsed LDA is 50 times faster, collapsed is 200 times faster!
(Minor update) You can now "--dump best" in compiled models.
(Minor update) Some bugs regarding gamma distributions are fixed.
(Minor update) Type declaration syntax changed slightly.

New in Version 0.5

(Minor update) Many bug fixes, especially with respect to marginalization.
(Minor update) --loadD can now load 3 and 4 dimensional data sets; see LDA4d.hier (included in the distribution) for an example.
(Minor update) A large amount of code optimization is now enabled; many constant operations are precomputed and double loops no longer incur double memory (and time) overheads.

New in Version 0.4

(Minor update) Collection of bug fixes, mostly to do with Markovization and some other low-level type inference details.
(Minor update) --define and --load? can now take command-line arguments for compiled code. See LDA.hier for an example, but, eg., --loadD arg_{1} w V D N ; means to use the first command line argument to load w. Bad things will happen if you try to use this with simulate.

New in Version 0.3

(Major update) Bug fixes for doubly indexed variables. I.e., things like a_{b_{c}}. These were hardly working if at all. Many more things are possible with this fixes.
(Minor update) --dump now works with compiled code (with the exception of best dumping) and so you can actually see what the compled sampler is doing.
(Minor udpate) You can elect to maximize some variables (instead of sample) by saying --maximize VAR.
(Minor update) Now comes with an implementation of IBM model 1 for machine translation.

New in Version 0.2

(Major update) Dirichlet/Multinomial pairs can now be marginalized out! This means that you can, for instance, obtain the "collapsed Gibbs sampler" for LDA. To marginalize out a multinomial variable called, say, theta, you need only specify "--collapse theta" as an argument to hbc.
(Minor update) The generated C code now keeps track of the best sample thus far and displays an asterix whenever a better sample is encountered.
(Minor update) Command-line options can be specified directly in .hier source files. Simply begin a line with --# (which would otherwise be treated as a comment) and you can just write out any command-line option; see LDA.hier for an example.