Whpearson recently mentioned that people in some other online communities frequently ask "what are you working on?". I personally love asking and answering this question. I made sure to ask it at the Seattle meetup. However, I don't often see it asked here in the comments, so I will ask it:
What are you working on?
Here are some guidelines
- Focus on projects that you have recently made progress on, not projects that you're thinking about doing but haven't started, those are for a different thread.
- Why this project and not others? Mention reasons why you're doing the project and/or why others should contribute to your project (if applicable).
- Talk about your goals for the project.
- Any kind of project is fair game: personal improvement, research project, art project, whatever.
- Link to your work if it's linkable
Making bayesian statistics easier and more accessible by coding advanced sampling algorithms for PyMC
Some background: I took statistics in high school because it seemed vaguely useful. Unfortunately, the material seemed very dry and involved mostly memorization and few general principles. It was boring and limited. College statistics was the same thing. I did some internships and statistics seemed very useful for figuring things out, but I didn't know how to do very much.
Later I started reading Overcoming Bias, and Yudkowsky kept mentioning this thing called "Bayes theorem" and how it was really powerful. I read a stats book on Bayesian Statistics and my mind was blown. The statistics that I had been taught was a collection of formulas that gave answers but not much insight, but Bayes theorem encapsulated not just all of the statistics I had learned but the very notion of "learning from data." I was hooked.
Later I figured out that the curse of dimensionality makes complex problems difficult (even though the simple statistics taught in stats classes now are still easy).
My project: Bayes theorem provides a simple coherent framework for learning from data. It massively clarifies how to think about data. It is something all engineers (and technical folk in general) could and should know. Not only is Bayesian stats very practical, but it turns a topic that even nerds find confusing and boring into something elegant and interesting.
I want to make fitting bayesian models as thought free as possible. Calculating the posterior distributions for your models is often very difficult and usually the most constraining issue. This is often true (though less so) even if you know a great deal about statistical computation.
As I have discussed here, I think the current lowest hanging fruit is the use of gradients and higher derivatives in algorithms for sampling from the posterior distribution. Thus my project for the last year + has been to work on improving PyMC, a Python package for doing Bayesian inference, adding gradient information, writing advanced general sampling algorithms found in the literature, improving the syntax of PyMC to make it simpler, more intuitive, and more powerful.
On my blog I linked to a package I built with a sampler that works using Langevin dynamics (uses gradient information), but more recently I have found that Hybrid (or Hamiltonian) Monte-Carlo is practically much simpler and works much better. This is my Hamiltonian MC implementation.
I am currently trying to improve my HMC sampler, and working on making PyMC faster and easier to maintain and extend.
If you know Bayesian stats and have some programming skills, I invite you help me improve statistical computation! just message me!
Why in python instead of R? R is used much more widely among people actually doing statistics, as far as I know.