Learning One Abstract Bit at a Time Through Self-Invented Experiments Encoded as Neural Networks - 2022
Details
Title : Learning One Abstract Bit at a Time Through Self-Invented Experiments Encoded as Neural Networks Author(s): Herrmann, Vincent and Kirsch, Louis and Schmidhuber, Jürgen Link(s) : http://arxiv.org/abs/2212.14374
Rough Notes
There are 2 important things in science:
- Finding answers to given questions.
- Coming up with good questions.
(#NOTE These looks like they correspond to induction and experimental design and Marcus Hutter mentioned).
(#TODO Read the work they have done related to artificial curiosity and creativity)
Prior relevant work by the authors include their artificial Q&A system which was intrinsic motivation-based and adversarial. It has 2 deep nets: the controller \(C\) whose probabilistic outputs may influence the environment, and the world model \(M\) which predicts environmental reactions to \(C\)'s outputs. Training involves \(M\) trying to minimize its error, while \(C\) tries to find sequences of output actions which maximizes \(M\)'s error - \(C\)'s "intrinsic reward" is proportional to \(M\)'s prediction errors. Generative Adversarial Networks (GANs) fall under this formalism, where the environment returns 1 or 0, depending on whether or not the controller's output is in some set.
\(C\)'s action sequence can be thought of \(C\) asking questions: "What happens if I do that?" etc. \(M\) is learning to answer those questions. \(C\) is motivated to come up with questions where \(M\) does not yet know the answer and loses interest in questions with known answers.
This paper seems to solve issues faced by exploration strategies in deterministic environments (see reference 39,43). In the stochastic setting, such strategies may make \(C\) learn to focus on parts of the environment where \(M\) can always get high prediction errors due to the randomness or computational limitations of \(M\) - e.g. \(C\) may get stuck in front of a TV screen showing white noise which is quite unpredictable. Hence in stochastic settings, \(C\)'s reward should not be \(M\)'s errors but (an approximate of the) first derivative of \(M\)'s errors across subsequent training iterations i.e. \(M\)'s learning progress or improvements. In the white noise TV screen case, \(C\) will not get stuck since \(M\)'s errors will not improve - both the totally predictable and fundamentally unpredictable will get boring.
One approach involved \(M\) learning to predict the probabilities of the environment's possible responses given \(C\)'s actions. After each iteration with the environment, \(C\)'s intrinsic reward the KL-divergence between \(M\)'s estimated probability distributions before and after the resulting new experience, the information gain, also called Bayesian surprise.
In partially observable environments, \(C,M\) may gain benefit from memory of previous events - which can be done via LSTMs for example.
#TODO Write up from here on.
Quick summary: Idea: Self-invented "experiments" in a reinforcement-providing environment lead to effective exploration.
We have:
- Controller \(C\) that designs experiments, with yes/no outcomes. Experiments may run for several timesteps. \(C\) will prefer simple experiments whose outcomes still surprise \(M\).
- Input: Sensory input vector \(in(t)\), external reward \(R_e(t)\), internal reward \(R_i(t)\).
- Output: START unit, which when active (>0.5), \(C\) uses a set of extra output units for producing the weight matrix (/program) \(\theta\) of a separate RNN called the experiment \(E\).
- Experiment \(E\): Inputs are sensory inputs \(in(t)\), outputs are actions, HALT and RESULT.
- World model \(M\) which predicts environment's reactions to \(C\)'s outputs.
At time \(t'\), \(\theta\) is generated which defines \(E\), it interacts with some environment until its HALT unit is active (>0.5), which we call time \(t''\). The experimental outcome \(r(t'')=1\) is RESULT is active, else 0.
At time \(t'\), \(M\) has to compute its output \(pr(t') \in [0,1]\) from \(\theta\) (and \(C\)'s history until \(t'\)).
\(M\)'s prediction \(pr(t')\) is compared with \(r(t'')\), \(C\) has an intrinsic curiosity reward proportional to \(M\)'s surprise. \(M\) is trained by GD (w/ regularization) to improve all predictions so far.
Through the weight matrix \(\theta\), \(E\) should:
- Initialize the experiment (i.e. reset environment, move agent to starting position etc).
- Run the experiment by executing the action sequence, and compute YES/NO based on the sensory input.
Experimental setup:
Electromagnetic field: Agent navigates through a 2D environment with a fixed external force field - states are the agent's position and velocity. Actions are real valued force vectors applied to itself. There is a large sparse reward at the goal state and a small negative reward per timestep.
Generated experiments are of the form \(E_\psi(s)=(a,\hat{r})\), (state, action, reward). No HALT unit, some time limit \(\tau\).
- Thought experiments: (No environment interactions).