POMDP definition

The POMDP we consider represented as a diagram. States are marked in red, solid arrow represent actions, dotted arrows represent state transitions (with their corresponding probabilities). Dotted arrows for deterministic transitions are omitted. $x_1, y_1, x_2, y_2, x_3$ and $y_3$ are all terminal states.

The POMDP we consider represented as a diagram. States are marked in red, solid arrow represent actions, dotted arrows represent state transitions (with their corresponding probabilities). Dotted arrows for deterministic transitions are omitted. $x_1, y_1, x_2, y_2, x_3$ and $y_3$ are all terminal states.

We wish to generalize the original power seeking theorem to partially observable cases. Here we present an instance of a POMDP with symmetry, in which POWER-seeking follows from Theorem A.13 of an unpublished draft by Alex Turner. The key is that the optimal agent in the belief MDP, defined below, is optimizing an EU-determined function, essentially a function of dot products between a reward function and linear combinations of the states of the POMDP.

Let us consider a POMDP pictured above:

  1. The state set is $\mathcal{S} =\{\text{☆}, s_1, x_1, y_1, x_2, y_2, s_2, x_3, y_3\}$. Assuming that order, we will represent states as one-hot vectors in $\mathbb{R}^9$, e.g. $s_1 = [0, 1, 0, 0, 0, 0, 0,0, 0]$.

  2. The action set $\mathcal{A} = \{\text{left}, \text{right} \}$.

  3. A state transition distribution $T: \mathcal{S}\times\mathcal{A} \to \Delta\mathcal{S}$. Out of convenience, we will represent $T$ as two $9\times9$ matrices, one for each action: $T = (T_\text{left}, T_\text{right})$.

    $T_\text{left}=\begin{bmatrix}0&0&0&0&0&0&0&0&0\\1&0&0&0&0&0&0&0&0\\0&0.5&1&0&0&0&0&0&0\\0&0.5&0&1&0&0&0&0&0\\0&0&0&0&1&0&0&0&0\\0&0&0&0&0&1&0&0&0\\0&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0.5&1&0\\0&0&0&0&0&0&0.5&0&1\end{bmatrix}$

    $T_\text{right}=\begin{bmatrix}0&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0&0&0\\0&0&1&0&0&0&0&0&0\\0&0&0&1&0&0&0&0&0\\0&0.5&0&0&1&0&0&0&0\\0&0.5&0&0&0&1&0&0&0\\1&0&0&0&0&0&0&0&0\\0&0&0&0&0&0&0.5&1&0\\0&0&0&0&0&0&0.5&0&1\end{bmatrix}$

  4. Observations: a set of observations $\Omega$, that the agent can receive, once each step. This POMDP is entirely unobservable, which we model be setting $\vert\Omega\vert=1$.

  5. Observation model: a conditional probability distribution $O:\mathcal{S}\to \Delta\Omega$. There is only one possible value when $\vert\Omega\vert=1$.

  6. A discount factor $\gamma \in (0,1)$.

Belief MDP definition

A Bayesian agent solves POMDPs by forming beliefs and acting upon them; those beliefs play the role of a state in an MDP. Therefore, each POMDP is can be converted to a particular MDP — a belief MDP — whose states are defined in terms of beliefs about original, hidden states. The belief MDP associated with our original POMDP is pictured below.

The belief MDP associated with our original MDP. States (beliefs) are marked in orange. Note that stochastic state transitions from our original POMDP are now replaced by deterministic belief transitions.

The belief MDP associated with our original MDP. States (beliefs) are marked in orange. Note that stochastic state transitions from our original POMDP are now replaced by deterministic belief transitions.

More formally, our belief MDP is defined by:

  1. The state set is $\mathcal{S}' = \Delta \mathcal{S}$, the space of probability distributions over the POMDP state set $\mathcal{S}$. We will represent these as probability vectors in $\mathbb{R}^9$, e.g. $b = [0.3,0,0,0.5,0,0.1,0,0,0.1]$.
  2. The action set remains the same.
  3. The state transition distribution is $\tau(b’\vert b,a) = \sum_{o\in \Omega}\sum_{s\in \mathcal{S}} b(s) \sum_{s’\in\mathcal{S}} T(s’\vert s,a) O(o\vert s’)$ where the first sum is restricted to those $o\in\Omega$ such that $b’= \frac{O(o\vert s')\sum_{s\in S}b(s)T(s'\vert s,a)}{\sum_{s''\in S}[O(o\vert s'')\sum_{s\in S}b(s)T(s''\vert s,a)]}$, that is, $b’$ is the rational Bayesian updated belief with prior $b$, taking action $a$, and observing $o$.
  4. The discount factor remains the same.

Note that belief MDPs, being true MDPs, have no observations or observation model. Intuitively, this is because the agent has perfect knowledge of its belief state.

Belief visit distributions