The POMDP we consider represented as a diagram. States are marked in red, solid arrow represent actions, dotted arrows represent state transitions (with their corresponding probabilities). Dotted arrows for deterministic transitions are omitted. $x_1, y_1, x_2, y_2, x_3$ and $y_3$ are all terminal states.
We wish to generalize the original power seeking theorem to partially observable cases. Here we present an instance of a POMDP with symmetry, in which POWER-seeking follows from Theorem A.13 of an unpublished draft by Alex Turner. The key is that the optimal agent in the belief MDP, defined below, is optimizing an EU-determined function, essentially a function of dot products between a reward function and linear combinations of the states of the POMDP.
Let us consider a POMDP pictured above:
The state set is $\mathcal{S} =\{\text{☆}, s_1, x_1, y_1, x_2, y_2, s_2, x_3, y_3\}$. Assuming that order, we will represent states as one-hot vectors in $\mathbb{R}^9$, e.g. $s_1 = [0, 1, 0, 0, 0, 0, 0,0, 0]$.
The action set $\mathcal{A} = \{\text{left}, \text{right} \}$.
A state transition distribution $T: \mathcal{S}\times\mathcal{A} \to \Delta\mathcal{S}$. Out of convenience, we will represent $T$ as two $9\times9$ matrices, one for each action: $T = (T_\text{left}, T_\text{right})$.
$T_\text{left}=\begin{bmatrix}0&0&0&0&0&0&0&0&0\\1&0&0&0&0&0&0&0&0\\0&0.5&1&0&0&0&0&0&0\\0&0.5&0&1&0&0&0&0&0\\0&0&0&0&1&0&0&0&0\\0&0&0&0&0&1&0&0&0\\0&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0.5&1&0\\0&0&0&0&0&0&0.5&0&1\end{bmatrix}$
$T_\text{right}=\begin{bmatrix}0&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0&0&0\\0&0&1&0&0&0&0&0&0\\0&0&0&1&0&0&0&0&0\\0&0.5&0&0&1&0&0&0&0\\0&0.5&0&0&0&1&0&0&0\\1&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0.5&1&0\\0&0&0&0&0&0&0.5&0&1\end{bmatrix}$
Observations: a set of observations $\Omega$, that the agent can receive, once each step. This POMDP is entirely unobservable, which we model be setting $\vert\Omega\vert=1$.
Observation model: a conditional probability distribution $O:\mathcal{S}\to \Delta\Omega$. There is only one possible value when $\vert\Omega\vert=1$.
A discount factor $\gamma \in (0,1)$.
A Bayesian agent solves POMDPs by forming beliefs and acting upon them; those beliefs play the role of a state in an MDP. Therefore, each POMDP is can be converted to a particular MDP — a belief MDP — whose states are defined in terms of beliefs about original, hidden states. The belief MDP associated with our original POMDP is pictured below.
The belief MDP associated with our original MDP. States (beliefs) are marked in orange. Note that stochastic state transitions from our original POMDP are now replaced by deterministic belief transitions.
More formally, our belief MDP is defined by:
Note that belief MDPs, being true MDPs, have no observations or observation model. Intuitively, this is because the agent has perfect knowledge of its belief state.