Notes on Occam via Solomonoff vs. hierarchical Bayes

What's the right way of encoding a bias towards simplicity in a Bayesian framework?

Feb 10, 2025

Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places, Solomonoff induction (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are programs that spit out observations. Programs of length C_L get prior probability 2^-C_L, where C_L is the program's length (in language L).)

But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.

Simplicity via hierarchical Bayes

I’m not sure to what extent we need to directly bake in a bias towards simpler hypotheses in order to reproduce our usual inductive inferences or to capture the intuition that simpler theories tend to be better. Maybe we could at least get a long way with a hierarchically-structured prior, where:
- At the highest level, different theories T specify fundamental ontologies. For example, maybe the fundamental ontology of Ptolemaic astronomy was something like “The Earth is at the center of the universe, and all other bodies move along circles”.
- Each theory T contains many specific, disjoint hypotheses, corresponding to particular “parameter values” for the properties of the fundamental objects. For example, Ptolemaic astronomy as a high-level theory allows for many different planetary orbits.
- More complicated theories are those that contain many specific hypotheses. Complicated theories must spread out prior mass over more hypotheses, and if prior mass is spread evenly over the high-level theories, any individual hypothesis will get lower prior mass than individual hypotheses contained in simpler theories. I.e.:
  - Let h1, h2 be hypotheses in T1, T2 respectively.
  - Suppose T1 is simpler than T2. Then, generally we will have P(h1 | T1) > P(h2 | T2), because T2 has to spread out prior mass more thinly than T1.
  - If P(T1) = P(T2), then we have P(h1) = P(h1 | T1)*P(T1) > P(h2 | T2)*P(T2) = P(h2).
- This means that we can spread out prior mass evenly over the high-level theories (rather than giving lower prior mass to the complex high-level theories), and still find that the posterior mass of complex hypotheses is lower than that of equally-well-fitting simple hypotheses.
- Again, this way of thinking about the relationship between Bayesianism and simplicity is not original to me. See Henderson (2014) for a discussion in the philosophy of science, and Rasmussen and Gharamani (2000) for a discussion in the context of Bayesian machine learning. Huemer (2016) and Builes (2022) apply such reasoning to argue against skeptical theories.
A problem with this view: It’s not clear how to decide what should be a high-level theory. E.g., are Copernican and Ptolemaic astronomy two high-level theories, or are they two sub-theories of the high-level theory that says planets move along circles (but doesn’t fix the behavior of the Sun or Earth)?
- Intuitively, this doesn’t bother me a huge amount. Even if it ends up being underdetermined how to do this, my guess is that reasonable ways of individuating high-level theories will still constrain our inferences a lot. But, maybe not, I haven’t thought about it much.

Syntax vs. ontology

SI assigns prior probabilities according to the syntax (in an arbitrary language) used to specify a theory. Setting aside the other problems for SI (e.g., see this post), I think this is pretty unsatisfactory as an attempt to capture our intuitive preference for simplicity, for a few reasons:
- First of all, I’d like to avoid just specifying by fiat that simpler hypotheses get higher prior probability and instead have this be a consequence of more solid principles. I think the principle of indifference is solid, if we can find a privileged parameterization of the hypothesis space to which we can apply the principle. The approach sketched above is attractive to me in this respect: We can try to apply a principle of indifference* at the level of fundamental ontological commitments, which has the consequence that hypotheses contained in more complex theories get lower prior mass.
  - *Of course, if we’re considering infinitely many theories/hypotheses we’re going to run into trouble trying to use the principle of indifference. But I still think this view takes us a long way.
  - A commenter points out that Solomonoff induction can be seen as the application of the principle of indifference, i.e., “where you just take the uniform prior over all programs of length T, then let T go to infinity”. To be clear, my view is that the POI should be used when there is a nonarbitrary partition of the hypothesis space to which it can be applied, and this application of the POI is language-dependent. Whereas, on the hierarchical view, the hope is that the privileged parameterization to which you can apply the POI is something like “properties of the fundamental entities in the theory (e.g., positions and momenta of particles in Newtonian mechanics, maybe?)”. (See Huemer (2009) and Climenhaga (2020) on applying the POI at the “explanatorily basic” level.)
- Second of all, insofar as we do want to directly penalize more complex hypotheses, syntactic simplicity does not seem like the way to go. Surely when we intuit that simple theories are better, we have in mind the simplicity of a theory’s ontology (how many entities it posits, how uniform its laws are, etc). While the syntactic simplicity (in some natural-to-us programming language) of specifying a theory presumably correlates with the kind of simplicity we actually care about, they don’t seem to be the same thing.
  - So I would say: If you do want to directly assign prior probabilities to hypotheses according to their simplicity, you should start by looking at what the hypothesis actually says about the world and figure out how to measure the simplicity of that.
A possible response: Solomonoff induction is already a perfectly rigorous theory, which at least accords with many of our intuitions about epistemology. On the other hand, all this business about ontologies has yet to be formalized, and it’s far from clear that any satisfying formalism exists.
- My reply: This sounds like the streetlight effect. The reason that SI has a nice formalism is that it only looks at an easily-extracted property of a hypothesis (its syntax), and doesn’t attempt to extract the thing we should directly care about: what the hypothesis actually says about the world.
- Moreover, thinking in ontological terms may help make progress on one of the IMO serious problems for SI, the apparently arbitrary choice of language. For example, we may in the end decide that the best we can do is SI using a language that makes it easy to specify a hypothesis in terms of its ontology?

References

Builes, David. 2022. “The Ineffability of Induction.” Philosophy and Phenomenological Research 104 (1): 129–49.

Climenhaga, Nevin. 2020. “The Structure of Epistemic Probabilities.” Philos. Stud. 177 (11): 3213–42.

Henderson, Leah. 2014. “Bayesianism and Inference to the Best Explanation.” The British Journal for the Philosophy of Science 65 (4): 687–715.

Huemer, Michael. 2009. “Explanationist Aid for the Theory of Inductive Logic.” The British Journal for the Philosophy of Science 60 (2): 345–75.

———. 2016. “Serious Theories and Skeptical Theories: Why You Are Probably Not a Brain in a Vat.” Philosophical Studies 173 (4): 1031–52.

Rasmussen, Carl, and Zoubin Ghahramani. 2000. “Occam’s Razor.” Advances in Neural Information Processing Systems 13. https://proceedings.neurips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.html.

Jesse’s Substack

Discussion about this post