Marcus Hutter: Foundations of Induction - 2022

Details

Title : Marcus Hutter: Foundations of Induction Author(s): Machine Learning and Dynamical Systems Seminar Link(s) : https://www.youtube.com/watch?v=bn060on1hKs

Rough Notes

Motivation

Hypothesis testing/identification would ask:

Does treatment X cure cancer?
Do observations of white swans confirm that all ravens are black?

Model selection asks, given some data:

Are planetary orbits circles or ellipses?
How many wavelets do I need to describe my picture well?
Which genes predict cancer?

Parameter estimation asks:

What is the bias of my coin?
What is the eccentricity of earth's orbit?

Sequence prediction involves:

Predicting the weather/stock price/ etc. tomorrow based on past sequence.
Some IQ tests can test sequences like 1,4,9,16,_?

Classification (which can be reduced to sequence prediction) involves:

Predict whether the email is spam or not.

The question here is, Is there a general, formal, complete, consistent theory for induction and prediction

Beyond induction, there is active/reward learning, optimization, game theory etc.

The need for such a unified theory is justified by problems caused due to multiple theories being prone to disagreement/contradiction. Axiomatization helped mathematicians and it (should) also help induction [in the speaker's view]. It is also important since induction is half of science, the other half being experimental design.

Any math can be said in words, but not all words can be put into equations, since a lot of words are nonsense (Clifford A. Truesdell 1966).

The most important slide:

.	Induction	Deduction
Types of inference	Generalization/Prediction	Specialization/Derivation
Framework	Probability axioms	Logical axioms
Assumptions	Prior	Non-logical axioms
Inference rules	Bayes rule	Modus ponens
Results	Posterior	Theorems
Universal scheme	Solomonoff probability	ZF/ZFC
Universal inference	Universal induction	Universal theorem prover
Limitation	Incomputable	Incomplete
In practice	Approximations	Semi-formal proofs
Operation	Computation	Proof

Logical axioms include things like the law of excluded middle, while non-logical axioms include the Peano axioms. Approximations include the Minimum Description Length (MDL) principle.

An important point of the table is that whenever we want to criticize the Induction column, we can also criticize the Deduction column.

Critique

The critique here is meant as a critique of existing alternatives to the induction problem. The most famous is the work of Popper. The speaker argues that his ideas of falsification and corroboration are seriously flawed (e.g. see Godfrey-Smith 2003, Salmon 1981, Putnam 1974, Schlipp 1974).

Let's start with the demarcation problem: What is the difference between a scientific and a non-scientific theory?

Popper's solution is falsification, i.e. a hypothesis is scientific if and only if it can be refuted by possible observation - it is a matter of deductive logic.

Problems with Falsification:

Stochastic models can never be falsified as they can only be unlikely rather than inconsistent

with data.

It cannot prefer a well tested theory over a brand-new

untested theory [#DOUBT I guess this is like complex vs. simple models which both explain observed data?].

To tackle this latter problem, Popper does suggest we apriori must prefer "reasonable" theories over "obscure" theories, he thinks simple models should be preferred as they are easier to falsify, thus Popper equates simplicity with falsifiability. A complex theory (which can explain more data) with fixed parameters is as easy to falsify as a simple theory.

Going over some more of Popper's views:

(Fallibilism) We can never be completely certain about factual issues (Speaker agrees).
(Skepticism) Scientific confirmation is a myth (Speaker disagrees).
(No confirmation) We cannot even increase confidence in truth of a theory when it passes observational tests.
Induction is a myth, but science does not need it anyway.
(Corroboration) A theory that has survived many attempts to falsify it is "corroborated", and it is rational to choose more corroborated theories.

The speaker's problem with these views is that corroboration is now just a new name for confirmation, else the above views are pointless.

The speaker says the No Free Lunch (NFL) Theorems are a myth, e.g. consider algorithms for finding the maximum of a function, and compare their performance uniformly averaged over all functions over some fixed finite domain. The NFL theorems say that since sampling uniformly gives a totally random white noise function with very high probability, it is clear that on average no optimization algorithm can perform better than exhaustive search. Thus all reasonable optimization algorithms are equally good/bad on average, but NFL has no practical implication since nobody cares about maxima of white noise functions.

The solution is then to take a universal prior (no domain knowledge) instead of a uniform prior - universal sampling [#DOUBT I guess this means sampling from the universal prior?] is as close as having no prior knowledge as is needed for any practical purpose.

Problems with frequentism:

Probability of an event is defined as the limiting relative frequency of seeing it, \(\mathh{P}(E):=\lim_{n\to \infty}\frac{\# E_n}{n}\)
Speaker argues this definition is circular, e.g. flipping a fair coin would the chance of heads as 0.5 with probability 1 (#DOUBT I guess he is meaning convergence in probability), meaning we explain "Probability of \(E\)" in terms of "Probability 1" - so what does "Probability 1" mean. Cournot's principle can help resolve this circularity.
Limited to i.i.d. cases, the real world is not i.i.d.
Reference class problem: Counting frequencies of a target variable among similar features - there are no 2 similar samples. Machine learning with feature selection can help here.

Problems with statistical learning theory:

Deals mostly with i.i.d data, as mentioned above real life is more complex - it is non-stationary and non-ergodic.

Problems with other approaches:

Subjective Bayesians have no formal procedure to get the prior.
Objective Bayesians are right in spirit, but limit to small classes e.g. criteria like reparametrization invariance leading to Jeffrey's prior etc.
MDL/MML - practical approximations of universal induction.
Pluralism, using multiple theories - globally inconsistent.
Deductive logic is not string enough for induction.
Non-monotonic reasoning, inductive logic, default reasoning etc. do not properly quantify uncertainty.
Carnap's confirmation theory - Only for exchangeably data, cannot confirm universal hypotheses (Solomonoff's induction theory can be thought of as a big leap from this).
Machine learning/data-driven science - data maybe more important than algorithms for simple problems, but a lookup-table AGI will not work.
Eliminative induction - ignores uncertainty and information theory.

Most of these approaches work in their limited domains, we want to ask ourselves if something better exists.

Universal Induction

Timeline of foundations of universal induction:

Occam's razor.
Epicurus's principle of multiple explanations.
Bayes rule for conditional probabilities.
Turing's universal machine.
Kolmogorov complexity.
Solomonoff's universal prior (solves the question of choosing the prior if nothing is known).

Marvin Minsky stated that algorithmic probability is the most important discovery since Godel (Limits of Understanding, World Science Festival 2010).

Science ~ Induction ~ Occam's razor

Consider the Grue Emerald Paradox:

Hypothesis 1: All emaralds are green.
Hypothesis 2: All emeralds found till 2050 are green, afterwards all emeralds are blue.

Falsification does not help since both are equally falsifiable. Occam's razor helps, we take the simplest hypothesis consistent with the data. Now, how do we quantify simplicity? By description length.

(No science without Occam's razor)

The shortest program for a string on a Turing machine \(T\) leads to best extrapolation (and thus prediction) - Kolmogorov Complexity of some object \(x\) (a string, an encoded hypothesis etc) and let \(T\) be a Turing machine, then define \(K_T(x)=\min_p \{l(p):T(p)=x\}\). We can show that the description length of \(x\) with respect to a universal Turing machine \(U\) is always smaller than with respect to any other coding scheme (#DOUBT), i.e. \[ \text{Kolmogorov-complexity}(x) = K_U(x) \leq K_T(x)+c_T \]

Bayesian probability provides a way (Bayes rule) to update probability of hypotheses as we get more data - but it does not tell how to choose the prior.

Combining these, we get to Algorithmic Probability Theory. Specifically, apriori to choose a uniform prior over hypotheses \(H_i\) can be quantified in terms of Kolmogorov complexity: \[ \mathbb{P}(H_i) := w_{H_i}^U := 2^{-K_{T/U}(H_i)} \] Where we choose \(U\) for universal priors, or a specific Turing machine \(T\) for a practical system. Fixing \(T\) will give a complete theory for prediction, but choosing \(T\) is a hard problem. Choosing \(U\) (particular choice of \(U\) does not matter) makes the problem incomputable.

There are 3 different representations of Solomonoff's distribution, a universal probability distribution \(M\), where \(M(x)\) is the probability that a universal Turing machine outputs \(x\) when provided fair coin flips on the input tape. This gives a formal theory of sequential prediction. Then, the predictive probability of \(y\) given \(x\) is \(M(y|x)=\frac{M(xy)}{M(x)}\). Given \(x_1,\cdots, x_{t-1}\) the probability of \(x_t\) is \(M(x_t|x_1,\cdots,x_{t-1})\). These hold assuming absolutely no apriori knowledge about the domain, and if you have domain knowledge there are ways to incorporate it.

\(M\) has various interesting properties and prediction bounds [skipped in this note].

\(M\) solves the raven problem, where Bayes-Laplace 0 for \(\mathbb{P}(\text{All ravens black}|n \text{ black ravens})\) while for universal prior this converges to 1 fast. It also has reparametrization and regrouping invariance. There is also no risk of biasing prior towards past data since the universal prior is fixed and independent of the model class, and updating the model is not necessary since the universal class already includes all. \(M\) also predicts better than all other mixture predictors. (#DOUBT What does all of this mean).

\(K,M\) can be served as gold standards, which are approximated in practice - MDL (Ris89), MML (Wal05), LZW (LZ76), CTW (WSTT95), NCD (CV05).

Universal Artificial Intelligence

Idea is to go from induction to prediction to decision to action.

We can also formalize sequential decision theory with the ideas above.

Key idea is that the optimal action/plan/policy based on the simplest model consistent with history can be formalized as an optimization task, slides call it AIXI.

Approximations and Applications

MDL is a common approximation to Solomonoff.

Universal clustering: When is object \(x\) similar to object \(y\)? Can be formalized using K-complexity.

Levin search: Fastest algorithm for inversion and optimization problems. Has a theoretical application: if a non-constructive proof of P=NP is found, then Levin search is a polynomial time algorithm for every NP (complete) problem. Practical application: Maze, towers of Hanoi, robotics (See J. Schmidhuber).

Conclusions

Conceptually and mathematically, the problem of induction is solved.

Computationally and philosophically some questions remain.

Ingredients are Occam, Epicurus, Turing, Bayes, Kolmogorov, Solomonoff. For decisions and actions, include Bellman.

Roughly, Induction ~ Science ~ Machine learning ~ Occam's razor ~ Compression ~ Intelligence.

(Understanding is also compression).

Some advice to philosophers and scientists interested in the foundations of induction:

Accept Universal Induction (UI) as the best conceptual solution of the induction problem so far.
People who have not understood the giants (Bayes, Shannon, Turing, Kolmogorov, Solomonoff, Wallace, Rissanen, Bellman) and try to reinvent the wheel from scratch can be safely ignored.
Never trust a theory if it not supported by expertiment.
[Some more not included in this note].

It is okay to ignore UI if:

Your approach already works sufficiently well.
Your problem is simple enough e.g. i.i.d.
You do not need/care about a principled/sound solution.
You are happy to succeed by trial-and-error.

Outlook:

Use compression size as a general performance measure like how perplexity is used in speech.
Using code-length view, many approaches become comparable, can be regarded as approximations to UI.
This should lead to better compression algorithms which in turn should lead to better learning algorithms.
Address open problems in induction within the UI framework.