by [anonymous]
2 min read

25

I've been reading about Church, which is a new computer language, developed in a prize-winning MIT doctoral thesis, that's designed to make computers better at modeling probability distributions.  

The idea is that simulations are cheap to run (given a probability distribution, generate an example outcome) but inferred distributions are expensive to run (from a set of data, what was the most likely probability distribution that could have generated it?) This is essentially a Bayesian task, and it's what we want to do to understand, say, which borrowers are likeliest to default, or where terrorists are likely to strike again.  It's also the necessary building block of AI.  The problem is that the space of probability distributions that can explain the data is very big.  Infinitely big in reality, of course, but still exponentially big after discretizing.  Also, while the computational complexity of evaluating f(g(x)) is just f + g, the computational complexity of composing two conditional probability distributions B|A and C|B is

ΣB P(C, B|A)

whose computational time will grow exponentially rather than linearly.

Church is an attempt to solve this problem.  (Apparently it's a practical attempt, because the founders have already started a company, Navia Systems, using this structure to build probabilistic computers.)  The idea is, instead of describing a probability distribution as a deterministic procedure that evaluates the probabilities of different events, represent them in terms of probabilistic procedures for generating samples from them.  That is, a random variable is actually a random variable.  This means that repeating a computation will not give the same result each time, because evaluating a random variable doesn't give the same result each time.  There's a computational advantage here because it's possible to compose random variables without summing over all possible values.

Church is based on Lisp. At the lowest level, it replaces Boolean gates with stochastic digital circuits.  These circuits are wired together to form Markov chains (the probabilistic counterpart of finite state machines.)  At the top, it's possible to define probabilistic procedures for generating samples from recursively defined distributions.

 

When I saw this paper, I thought it might make an interesting top-level post -- unfortunately, I'm not the one to do it.  I don't know enough computer science; it's been ages since I've touched Lisp, and lambda calculus is new to me.  So this is an open call for volunteers -- any brave Bayesians want to blog about a brand new computer language?

(As an aside, I think we need more technical posts so we don't spend all our time hissing at each other; how would people feel about seeing summaries of recent research in the neuro/cognitive science/AI-ish cluster?)

New Comment
22 comments, sorted by Click to highlight new comments since:

The key idea behind Church and similar languages is that they allow us to express and formally reason about a large class of probabilistic models, many of which cannot be formalized in any concise way as Bayes nets.

Bayes nets express generative models, i.e. processes that generate data. To infer the states of hidden variables from observations, you condition the Bayes net and compute a distribution on the hidden variable settings using Bayesian inference or some approximation thereof. A particularly popular class of approximations is the class of sampling algorithms, e.g. Markov Chain Monte Carlo methods (MCMC) and importance sampling.

Probabilistic programs express a larger class of models, but very similar approximate inference algorithms can be used to condition a program on observations and to infer the states of hidden variables. In both machine learning and cognitive science, when you are doing Bayesian inference with some model that expresses your prior belief, you usually code both model and inference algorithm and make use of problem-specific approximations to Bayesian inference. Probabilistic programs separate model from inference by using universal inference algorithms.

If you are interested in this set of ideas in the context of cognitive science, I recommend this interactive tutorial.

Church is based on Lisp. At the lowest level, it replaces Boolean gates with stochastic digital circuits. These circuits are wired together to form Markov chains (the probabilistic counterpart of finite state machines.) At the top, it's possible to define probabilistic procedures for generating samples from recursively defined distributions.

This confuses Church as a language for expressing generative models with ideas on how to implement such a language in hardware. There are three different ideas here:

  • Church as a language for generative models with sampling-based semantics
  • MCMC as an approximate inference method for such models (that can be implemented on traditional von Neumann architectures)
  • Machine architectures that are well-suited for MCMC

So this is an open call for volunteers -- any brave Bayesians want to blog about a brand new computer language?

I'll write an exposition within the next weeks if people are interested.

[-][anonymous]30

Thank you so much. I'm sorry I was confused, and I'm glad someone is around who knows more. And thank you so much for the tutorial! This is proving to be much clearer than the papers. You are a prince.

[-]Cyan20

Do eet.

you usually code both model and inference algorithm and make use of problem-specific approximations to Bayesian inference. Probabilistic programs separate model from inference by using universal inference algorithms.

Naively, separation of model from inference algorithm seems like a terrible idea. People use problem-specific approximation algorithms because if they don't exploit specific features of the model, inference will be completely intractable. Often this consideration is so important that people will use obviously sub-optimal models, if those models support fast inference.

Yes, deriving mechanisms that take complex models and turn them into something tractable is mostly an open problem.

Please do.

(As an aside, I think we need more technical posts so we don't spend all our time hissing at each other; how would people feel about seeing summaries of recent research in the neuro/cognitive science/AI-ish cluster?)

Upvoted for this alone; I've always felt one of the things that kept LW good was that the subject matter and terminology acted as a filter for participation.

Are the inference methods really advanced enough to make this kind of thing practical? I know that in statistics, MCMC methods have long been specialized enough that in order to use them for even relatively simple problems (lets say a 20 parameter linear model with non normal errors) you have to know what you're doing. I have been interested in this field because recently there have been promising algorithms which promise to make inference much easier for problems with continuous variables (in particular gradient/hessian based algorithms analogous to newton methods on optimization). Probabilistic programs seems like a much more difficult category to work with likely to get very slow very fast. Am I missing something? What do people hope to accomplish with such a language?

Current universal inference methods are very limited, so the main advantages of using probabilistic programming languages are (1) the conceptual clarity you get by separating generative model and inference and (2) the ability to write down complex nonparametric models and immediately be able to do inference, even if it's inefficient. Writing a full model+inference implementation in Matlab, say, takes you much longer, is more confusing and less flexible.

That said, some techniques that were developed for particular classes of problems have a useful analog in the setting of programs. The gradient-based methods you mention have been generalized to work on any probabilistic program with continuous parameters.

Interesting, I suppose that does seem somewhat useful; for discussion purposes at the very least. I am curious about how a gradient-based method can work without continuous parameters: that is counter intuitive for me. Can you throw out some keywords? Keywords for what I was talking about: Metropolis-adjusted Langevin algorithm (MALA), Stochastic Newton, any MCMC with 'hessian' in the name.

They don't work without continuous parameters. If you have a probabilistic program that includes both discrete and continuous parameters, you can use gradient methods to generate MH proposals for your continuous parameters. I don't think there are any publications that discuss this yet.

Oh, ok that makes perfect sense. Breaking inference problems into sub problems and using different methods on the sub problems seems like a common technique.

Writing a full model+inference implementation in Matlab, say, takes you much longer, is more confusing and less flexible.

Defining and implementing a whole programming language is way more work than writing a library in an existing language. A library, after all, is a language, but one you don't have to write a parser, interpreter, or compiler for.

I was comparing the two choices people face who want to do inference in nontrivial models. You can either write the model in an existing probabilistic programming language and get inefficient inference for free or you can write model+inference in something like Matlab. Here, you may be able to use libraries if your model is similar enough to existing models, but for many interesting models, this is not the case.

Ok, I was comparing the effort at the meta-level: providing tools for computing in a given subject area either by making a new language, or by making a new library in an existing language.

It looks like there a number of other probabilistic languages out there.

Including lambda circle(? it is symbolic) which is associated with Sebastian Thrun and some real world robotics. An ocaml variant.

There is also another ocaml variant called Ibal, which doesn't seem to use the sampling method the other two are talking about.

LtU has some discussion on them.

I'm also not a lisp speaker, but I have a passing knowledge of ocaml syntax so I might have a look at lambda circle and blog on it. Once I figure out how to google for it.

Including lambda circle(? it is symbolic) which is associated with Sebastian Thrun and some real world robotics. An ocaml variant.

I was reminded of Sebastian Thrun's CES that I heard about when reading EY's Selling Non-Apples. Anyone know the differences between them?

I was really interested in CES when I read about it because of the idea of being able to seamlessly combine multiple sources of information about the same phenomenon, which is what you would need for a real-world agent that can perform general reasoning.

(I was especially interested when I saw scene in The Dark Knight where they convert the sonar data from a bunch of cell phones into an image of a city, and got to thinking about how you implement something like that, to the extent that it's realistic, but please don't hold that against me ... )

Regarding parenthetical, summaries of that sort would be good.

I'm in general pessimistic that were every going to have a really good approach to handling these sorts of probability calculations because they have some versions that are hard. For example, SAT which is NP-complete can be sort of thought of as trying to get a probability distribution where each event is assigned probability of 0 or 1. This doesn't resemble anything rigorous, more just a general intuition. However, the fact that they think that Church might be used for practical applications does make this look good. Also, I may be too focused on a computational complexity approach to things rather than a more general "this makes this easier to program and understand" approach.

Probabilistic inference in general is NP-hard, but it is not clear that (1) this property holds for the kinds of problems people are interested in and, even if it does, that (2) approximate probabilistic inference is hard for this class of problems. For example, if you believe this paper, probabilistic inference without extreme conditional probabilities is easy.

[-][anonymous]20

(As an aside, I think we need more technical posts so we don't spend all our time hissing at each other; how would people feel about seeing summaries of recent research in the neuro/cognitive science/AI-ish cluster?)

Yes to more technical summary posts, 'cuz I think they could be useful for exposing people doing IA or AI research to ideas that might be otherwise missed in the information sea.

To the aside: I'd find it interesting at the least, and I'd hope that some of the research would come with information that could actually be put into use. Eliezer's sequences for the site were great because they built solidly upon research.

If anyone is interested I'm doing an MSc project using Church see here. Does anybody know of an intuitive explanation of Metropolis Hastings? The literature is replete with the assumption that people are statisticians or a pro mathematician, and I haven't got the time to catch up.