A Method for Integrating Expert Knowledge When Learning Bayesian Networks From Data - 2011

Details

Title : A Method for Integrating Expert Knowledge When Learning Bayesian Networks From Data Author(s): Cano, A. and Masegosa, A. R. and Moral, S. Link(s) : http://ieeexplore.ieee.org/document/5872071/

Rough Notes

The concept of using expert knowledge is introduced as a way to tackle the bottlenecks of learning Bayesian Networks (BNs) e.g. needing many samples, only being able to learn the structure up to a class of graphs etc.
Cites (??, ????), where the expert is first asked about a total ordering of the variables, followed by their belief that each potential parent is a real parent. (I assume this can be visualized by drawing the variables according to the ordering, then all possible directed edges from this ordering, and each directed edge here has a probability of existing)
Mentions disadvantages of existing knowledge elicitation methods for BNs:
- Expert having to give probabilities about all edges.
- Expert can be biased to give the easiest edges, which could already be learnt without the expert.
- Expert is not given any information about what knowledge may be known based on using learning algorithms on existing data.
This approach instead puts the human in the loop, where the expert is asked about what they know during the learning process.
Notation:
- BN is over $n$ variables, each variable $X_i \in Val(X_i)$. The BN model itself consists of a DAG $G$ and a parameter vector $\Theta_G$ which defines a multinomial distribution $\mathbb{\theta}_{X_{i}| \mathbf{u}_i}$ for each $X_i$ and assignment $\mathbf{u}_{ij}$ with $j \in \{1,...,|Val(\mathbf{U}_i)}|\}$.
- $\mathcal{U}_i$ is the random variable taking values in the space of possible parent sets. E.g. if the ordering is $12...n$, $Val(\mathcal{U}_i) = \{ \mathbf{U}_i : \mathbf{U}_i \subset \{X_1,...,X_{i-1}\}\}$. $\mathbb{u}_i$ is a particular assignment to the parent set $Pa(X_i)$.
- $Val(\mathcal{G})$ is the set of all possible graph structures consistent with the total order,
Assumptions:
- Causal ordering is known.
- Hierarchical prior over all candidate BNs is $\mathbb{P}(\mathcal{B}) = \mathbb{P}(\Theta_G|G)\mathbb{P}(G)$. Here, $\Theta_G$ defines a multinomial distribution $\mathbf{\theta}_{X_i|\mathbf{u}_i}$ for each $X_i$, so it is given a Dirichlet prior.
- The prior over the structure $\mathbb{P}(\mathcal{G})$ satisfies the structure modularity property, i.e. $\mathbb{P}(\mathcal{G})=\prod_i \mathbb{P}_i(\mathcal{U}_i)$, each factor is a distribution over possible parent sets of $X_i$.
- Parameter independence (cites a PhD thesis in Spanish for this) and parameter modularity.
- Expert's answers are independent given the parent set.
Under the above assumptions, the Bayesian Dirichlet equivalent score allows efficient computation of the marginal likelihood of the data given $G$ - exact score formula is in Equation 2.
The marginal probability of a presence of an edge (which they call feature) $X_j \rightarrow X_i$ is $\mathbb{P}(X_j \rightarrow X_i|D) = \sum_{\mathbf{U} \in Val(\mathcal{U}_i)} I_j(\mathbf{U})\mathbb{P}(\mathbf{U}|D)$ where $I_j$ is the indicator function which is 1 if $X_j \in \mathbf{U}$. This summation however is exponential in $|Val(\mathcal{U}_i)|$, so the authors want to reduce parent set sizes somehow. They suggest an MCMC approach, getting samples from $\mathbb{P}(\mathcal{U}_i|D)$, however noting that poor mixing rates occur they describe an importance sampling based approach. They describe in Algorithm 1 the IS approach to sample a parent set for $X_i$ with probability $q$.
The prior over the graph structure (Equation 9) is $\mathbb{P}(Pa(X_i)) = \frac{B(r+\alpha, (i-1)-r+\alpha)}{B(\alpha,\alpha)}$, $\alpha$ being 0.5 gives a non-informative Jeffrey's prior and 1 leads to a minimum-description-length based prior. This prior is motivated in Section II.A.
Two approaches are shown to integrate expert knowledge:
- A posteriori integration of knowledge: Approximate the distribution $\mathbb{P}(\mathcal{U}_i|D)$, then query expert about the absence/presence of most uncertain edges.
  
  Here, they choose the edge that most reduce the entropy of $\mathbb{P}(\mathcal{U}_i|E,D)$, $E$ is a Bernoulli variable for the edge. The expert is shown this edge, maybe alongside the current posterior probability for this edge. The expert could inquire about another edge if they are unsure. The system updates it posterior probability $\mathbb{P}(\mathcal{U}_i|\mathbf{e},D)$. (I do not see $\mathbf{e} clearly defined, I assume it is the information from the expert). This process can be stopped for e.g. once a certain threshold is reached. The expert is asked about $\mathbb{P}(e|X_j \rightarrow X_i)$, which is assumed to be definitive i.e. 0 or 1. Authors mention the possibility of expert giving wrong answer with a fixed probability $\sigma$, where if this is 0.5, the authors say we are at the case where we have no expert knowledge, assuming the update equations
- Iterative integration of expert knowledge: Expert is queried about edges whenever there is uncertainty, and this information is used.
  
  Here, the system starts by sampling candidate parent sets $T'$ times, where \(T'<
#TODO Experiments section