1 Introduction: Mimic a Deep Reinforcement Learner

Deep Reinforcement Learning has mastered human-level control policies in a wide variety of tasks [14]. Despite excellent performance, the learned knowledge remains implicit in neural networks and hard to explain: there is a trade-off between model performance and interpretability [11]. One of the frameworks for addressing this trade-off is mimic learning [1], which seeks a sweet spot between accuracy and efficiency, by training an interpretable mimic model to match the predictions of a highly accurate model. Many works [2, 5, 7] have developed types of mimic learning to distill knowledge from deep models to a mimic model with tree representation, but for supervised learning only. However, DRL is an unsupervised process, where agents continuously interact with an environment, instead of learning from a static training/testing dataset.

This work develops a novel mimic learning framework for Reinforcement Learning. We examine two different approaches to generating data for RL mimic learning. Within the first Experience Training setting, which allows applying traditional batch learning methods to train a mimic model, we record all state action pairs during the training process of DRL and complement them with Q values as soft supervision labels. Storing and reading the training experience of a DRL model consumes much time and space, and the training experience may not even be available to a mimic learner. Therefore our second Active Play setting generates streaming data through interacting with the environment using the mature DRL model. The active play setting requires an on-line algorithm to dynamically update the model as more learning data is generated.

U-tree [13, 20] is a classic online reinforcement learning method which represents a Q function using a tree structure. To strengthen its generalization ability, we add a linear model to each leaf node, which defines a novel Linear Model U-Tree (LMUT). To support the active play setting, we introduce a novel on-line learning algorithm for LMUT, which applies Stochastic Gradient Descent to update the linear models, given some memory of recent input data stored on each leaf node. We conducted an empirical evaluation in three benchmark environments with five baseline methods. Two natural evaluation metrics for an RL mimic learner are: (1) fidelity [7]: how well the mimic model matches the predictions of the neural net, as in supervised learning, and (2) play performance: how well the average return achieved by a controller based on the mimic model matches the return achieved by the neural net. Play performance is the most relevant metric for reinforcement learning. Perfect fidelity implies a perfect match in play performance. However, our experiments show that approximate fidelity does not imply a good match in play performance. This is because RL mimic learning must strike a balance between coverage: matching the neural net across a large section of the state space, and optimality: matching the neural net on the states that are most important for performance. In our experiments, LMUT learning achieves a good balance: the best match to play performance among the mimic methods, and competitive fidelity to the neural net predictions. The transparent tree structure of LMUT makes the DRL neural net interpretable. To analyze the mimicked knowledge, we calculate the importance of input features and extract rules for typical examples of agent behavior. For image inputs, the super-pixels in input images are highlighted to illustrate the key regions.

Contributions. The main contributions of this paper are as follow: (1) To our best knowledge, the first work that extends interpretable mimic learning to Reinforcement Learning. (2) A novel on-line learning algorithm for LMUT, a novel model tree to mimic a DRL model. (3) We show how to interpret a DRL model by analyzing the knowledge stored in the tree structure of LMUT.

The paper is organized as follow. Section 2 covers the background and related work of DRL, mimic learning and U-tree. Section 3 introduces the mimic learning framework and Sect. 4 shows how to learn a LMUT. Empirical evaluation is performed in Sects. 5 and 6 discusses the interpretability of LMUT.

2 Background and Related Work

Reinforcement Learning and the Q-function. Reinforcement Learning constructs a policy for an agent to interact with its environment and maximize cumulative reward  [18]. Such an environment can be formalized as a Markov Decision Process (MDP) with 4-tuple (SAPR), where at timestep t, an agent observes a state \(s_{t}\in S\), chooses a action \(a_{t}\in A\) and receives a reward \(r_{t}\in R\) and the next observation \(s_{t+1}\in S\) from environment. A Q function represents the value of executing action \(a_{t}\) under state \(s_{t}\) [18]. Given a policy \(\pi \), the value is the expectation of the sum of discounted reward \(Q_{t}(s_{t},a_{t}) = {{\,\mathrm{\mathbb {E}}\,}}_{\pi }(\sum _{k=0}^{\infty }\gamma ^{k} r_{t+k+1})\). Temporal difference learning updates the current Q-value estimates towards the observed reward and estimated utility of the resulting state \(s_{t+1}\). A Deep Q-Network (DQN) [14] uses a neural network architecture to approximate the Q function for complex feature and action spaces. Parameter (\(\theta \)) updates minimize the differentiable loss function:

$$\begin{aligned} \mathcal {L}(\theta _{i})&\approx (r_{t}+\gamma \max _{a_{t}} Q(s_{t+1},a_{t+1}|\theta _{i}) - Q(s_{t},a_{t}|\theta _{i}))^{2}\end{aligned}$$
(1)
$$\begin{aligned} \theta _{i+1}&= \theta _{i}+\alpha \nabla _{\theta }\mathcal {L}(\theta _{i}) \end{aligned}$$
(2)

Mimic Learning. Recent works on mimic learning  [1, 5, 7] have demonstrated that models like shallow feed-forward neural networks or decision trees can mimic the function of a deep neural net. In the oracle framework, soft output labels are collected by passing inputs to a large, complex and accurate deep neural network [22] Then we train a mimic model with the soft output as supervisor. The results indicate that training a mimic model with soft output achieves substantial improvement in accuracy and efficiency, over training the same model type directly with hard targets from the dataset.

U-Tree Learning. A tree structure is transparent and interpretable, allowing rule extraction and measuring feature influence [5]. U-tree [13] learning was developed as an online reinforcement learning algorithm for a tree structure representation. A U-tree takes a set of observed feature/action values as input and maps it to a state value (or Q-value). [20] introduces the continuous U-tree (CUT) for continuous state features. CUT learning generates a tree-based discretization of the input signal and estimates state transition probabilities in every leaf node [20]. Dynamic programming is applied to solve the resulting Markov Decision Process (MDP). Although CUTs have been successfully applied in test environments like Corridor and Hexagonal Soccer, constructing a CUT from raw data is rather slow and consumes significant computing time and space.

3 Mimic Learning for Deep Reinforcement Learning

Unlike supervised learning, a DRL model is not trained with static input/output data pairs; instead it interacts with the environment by selecting actions to perform and adjusting its policy to maximize the expectation of cumulative reward. We now present two settings to mimic the Q functions in DRL models.

Experience Training generates data for batch training, following [1, 5]. To construct a mimic dataset, we record all the observation signals \(I\) and actions \(a\) during the DRL process. A signal \(I\) is a vector of continuous features that represents a state (one-hot representation for discrete features). Then, by inputting them to a mature DRL model, we obtain their corresponding soft output Q and use the entire input/output pairs \(\{(\langle I_{1},a_{1}\rangle ,\hat{Q}_{1}(I_{1},a_{1})),(\langle I_{2},a_{2}\rangle ,\hat{Q}_{2}(I_{2},a_{2})),...,(\langle I_{T},a_{T}\rangle ,\hat{Q}_{T}(I_{T},a_{T}))\}\) as the experience training dataset.

Fig. 1.
figure 1

Experience Training setting

Active Play generates mimic data by applying a mature DRL model to interact with the environment. Similar to [19], our active learner \(\ell \) has three components: (qfI). The first component q is a querying function q(I) that gives the current observed signal \(I\), selects an action \(a\). The querying function controls \(\ell \)’s interaction with the environment so it must consider the balance between exploration and exploitation. Our querying function is the \(\epsilon \)-greedy scheme [14] (\(\epsilon \) decaying from 1 to 0). The second component f is the deep model that produces Q values: \(f:(I, a)\rightarrow {range}(\hat{Q})\).

As shown in Fig. 2, the mimic training data is generated in the following steps: Step 1: Given a starting observation signal \(I_{t}\) on time step t, we select an action \(a_{t}=q(I_{t})\), and obtain a soft output Q value \(\hat{Q}_{t} =f(I_{t}, a_{t}) \). Step 2: After performing \(a_{t}\), the environment provides a reward \(r_{t}\) and the next state observation \(I_{t+1}\). We record a labelled transition \(T_{t}=\{I_{t}, a_{t}, r_{t}, I_{t+1}, \hat{Q}_{t}(I_{t},a_{t}) \}\) where the soft label \( \hat{Q}_{t}(I_{t},a_{t})\) comes from the well trained DRL model. A transition is the basic observation unit for U-tree learning. Step 3: We set \(I_{t+1}\) as the next starting observation signal, repeat above steps until we have training data for the active learner \(\ell \) to finish sufficient updates over mimic model m. This process produces an infinite data stream (transitions \(T\)) in sequential order. We use minibatch online learning, where the learner returns a mimic model \(M\) after some fixed batchsize B of queries.

Compared to Experience Training, Active Play does not require recording data during the training process of DRL models. This is important for the following reasons. (1) Many mimic learners have access only to the trained deep models. (2) Training a DRL model often generates a large amount of data, which requires much memory and is computationally challenging to process. (3) The Experience Training data includes frequent visits to suboptimal states, which makes it difficult for the mimic learner to obtain an optimal return, as our evaluation illustrates.

Fig. 2.
figure 2

Active Play setting.

4 Learning Linear Model U-Trees

A neural network with continuous activation functions computes a continuous function. A regression tree can approximate a continuous function arbitrarily closely, given enough leaves. Continuous U-Trees (CUTs) are essentially regression trees for value functions, and therefore a natural choice for a tree structure representation of a DRL Q function. However, their ability to generalize is limited, and CUT learning converges slowly. We introduce a novel extension of CUT, Linear Model U-Tree (LMUT), that allows CUT leaf nodes to contain a linear model, rather than simple constants. Being strictly more expressive than a regression tree, a linear model tree can also approximate a continuous function arbitrarily closely, with typically many fewer leaves [4]. Smaller trees are more interpretable, and therefore more suitable for mimic learning.

As shown in Fig. 3 and Table 1, each leaf node of a LMUT defines a partition cell of the input space, which can be interpreted as a discrete state s for the decision process. Within each partition cell, LMUT also records the reward \(r\) and the transition probabilities p of performing action a on the current state s, as shown in the Leaf Node 5 of Fig. 3. So LMUT learning builds a Markov Decision Process (MDP) from the interaction data between environment and deep model. Compared to a linear Q-function approximator [18], a LMUT defines an ensemble of linear Q-functions, one for each partition cell. Since each Q-value prediction \(Q^{UT}_{N}\) comes from a single linear model, the prediction can be explained by the feature weights of the model.

We now discuss how to train an LMUT. Similar to [20], we separate the training into two phases: (1) Data Gathering Phase and (2) Node Splitting Phase.

Fig. 3.
figure 3

An example of Linear Model U-Tree (LMUT).

Table 1. Partition cell

4.1 Data Gathering Phase

Data Gathering Phase assigns transitions to leaf nodes and prepares them for fitting linear models and splitting nodes. Given an input transition \(T\), we pass it through feature splits down to a leaf node. As an option, an LMUT can dynamically build an MDP, in which case it updates transition probabilities, rewards and average Q values on the leaf nodes. The complete Data Gathering Phase process is detailed in part I (the first for loop) of Algorithm 1.

4.2 Node Splitting Phase

After node updating, LMUT scans the leaf nodes and updates their linear model with Stochastic Gradient Descent (SGD). If SGD achieves insufficient improvement on node N, LMUT determines a new split and adds the resulting leaves to the current partition cell. For computational efficiency, our node splitting phase considers only a single split for each leaf given a single minibatch of new transitions. Part II of Algorithm 1 shows the detail of the node splitting phase. LMUT applies a minibatch stagewise fitting approach to learn linear models in the leaves of the U-tree. Like other stagewise approaches [10], this approach provides smoothed weight estimates where nearby leaves tend to have similar weights. We use Stochastic Gradient Descent to implement the weight updates.

figure a

Stochastic Gradient Descent (SGD) Weight Updates is a straightforward well-established online weight learning method for a single linear regression model. The weights and bias of linear regression on leaf node N are updated by applying SGD over all Transitions assigned to N. For a transition \(T_{t}= \langle I_{t}, a_{t}, r_{t}, {I}_{t+1}, \hat{Q}(I_{t},a_{t}) \rangle \), we take \(I_{t}\) as input and \(\hat{Q}_{t} \equiv \hat{Q}(I_{t},a_{t})\) as label. We build a separate LMUT for each action, so the linear model on N is function of the J state features: \(Q^{UT}_{}(I_{t}|w_{N},a_{t}) = \sum _{j=1}^{J}I_{tj}w_{Nj}+w_{N0}\). We update the weights \(w_N\) on leaf node N by applying SGD with loss function \(\mathcal {L}(w_{N}) = \sum _{t}1/2(\hat{Q}_{t}-Q^{UT}_{}(I_{t}|w_{N},a_{t}))^{2}\). The updates are computed with a single pass over each minibatch.

figure b

Splitting Criterion is used to find the best split on the leaf node, if SGD achieves limited improvement. We have tried three splitting criteria including working response of SGD, Kolmogorov–Smirnov (KS) test and Variance Test. The first method aims to find the best split to improve working response of the parent linear model on the data for its children. But as reported in [10], the final result becomes less intelligible. The second method Kolmogorov–Smirnov (KS) test is a non-parametric statistical test that measures the differences in empirical cumulative distribution functions between the child datasets. The final Variance criterion selects a split that generates child nodes whose Q values contain the least variance. The idea is similar to the variance reduction method applied in CART tree. Like Uther and Veloso [20], we found that the Variance test works well with less time complexity than KS test (O(n) v.s. \(O(n^{2})\)), so we select the Variance test as the splitting criterion. Exploring the different possible splits efficiently is the main scalability challenge in LMUT learning (cf. [20]).

5 Empirical Evaluation

We evaluate the mimic performance of LMUT by comparing it with five other baseline methods under three evaluation environments. Empirical evaluation measures the mimic match in regression and game playing, under experience training and active play learning.

5.1 Evaluation Environment

The evaluation environments include Mountain Car, Cart Pole and Flappy Bird. Our environments are simulated by OpenAI Gym toolkit [3]. Mountain Car and Cart Pole are two benchmark tasks for reinforcement learning [17]. Mountain Car is about accelerating a car to the top of the hill and Cart Pole is about balancing a pole in the upright position. Mountain Car and Cart Pole have a discrete action space and a continuous feature space. Flappy Bird is a mobile game that controls a bird to fly between pipes. Flappy Bird has two discrete actions, and its observation consists of four consecutive images [14]. We follow the Deep Q-Learning (DQN) method to play this game. During the image preprocessing, the input images are first rescaled to 80*80, transferred to gray image and then binary images. With 6,400 features, the state space of Flappy Bird is substantially more complex than that for Cart Pole and Mountain Car.

5.2 Baseline Methods

Batch Methods. We fit the input/output training pairs \((\langle I,a\rangle ,\hat{Q}_{}(I,a))\) using batch tree learners. A CART regression tree [12] predicts for each leaf node, the mean Q-value over the samples assigned to the leaf. M5 [15] is a tree training algorithm with more generalization ability. It first constructs a piecewise constant tree and then prunes to build a linear regression model for the instances in each leaf node. The WEKA toolkit [8] provides an implementation of M5. We include M5 with Regression-Tree option (M5-RT) and M5 tree with Model-Tree option (M5-MT) in our baselines. M5-MT builds a linear function on each leaf node, while M5-RT has only a constant value.

On-line Learning Methods. The recent Fast Incremental Model Tree (FIMT) [9] method is applied. Similar to M5-MT, it builds a linear model tree, but can perform explicit change detection and informed adaption for evolving data stream. We experiment with a basic version of FIMT and an advanced version with Adaptive Filters on leaf nodes (named FIMT-AF).

5.3 Fidelity: Regression Performance

We evaluate how well our LMUT approximates the soft output (\(\hat{Q}\) values) from Q function in a Deep Q-Network (DQN). We report the standard regression metrics Mean Absolute Error (MAE), and Root Mean Square Error (RMSE).

Under the Experience Training setting, we compare the performance of CART, M5-RT, M5-MT, FIMT and FIMT-AF with our LMUT. The dataset sizes are 150 K transitions for Mountain Car, 70 K transitions for Car Pole, and 20 K transitions for Flappy Bird. Because of the high dimensionality of the Flappy Bird state space, storing 20 K transitions requires 32 GB main memory. Given an experience training dataset, we apply 10 fold cross evaluation to train and test our model.

For the Active Play setting, batch training algorithms like CART and M5 are not applicable, so we experiment only with online methods, including FIMT, FIMT-AF and LMUT. We first train the mimic models with 30k consecutive transitions from evaluation environments, and evaluate them with another 10k transitions. The result for the three evaluation environments are shown in Tables 2, 3 and 4. The differences between the results of LMUT and the results of other models are statistically significant (t-test; \(p <5\%\)).

Compared to the other two online learning methods (FIMT and FIMT-AF), LMUT achieves a better fit to the neural net predictions with a much smaller model tree, especially in the active play online setting. This is because both FIMT and FIMT-AF update their model tree continuously after each datum, whereas LMUT fits minibatches of data at each leaf. Neither FIMT nor FIMT-AF terminate on high-dimensional data.Footnote 1 So we omit the result of applying FIMT and FIMT-AF in the Flappy Bird environment. Comparing to batch methods, the CART tree model has significantly more leaves than our LMUT, but not better fit to the DQN than M5-RT, M5-MT and LMUT, which suggests overfitting. In the Mountain Car and Flappy Bird environments, model tree batch learning (M5-RT and M5-MT) performs better than LMUT, while LMUT achieves comparable fidelity, and leads in the Cart Pole environment. In conclusion, (1) our LMUT learning algorithm outperforms the state-of-the-art online model tree learner FIMT. (2) Although LMUT is an online learning method, it showed competitive performance to batch methods even in the batch setting.

Table 2. Result of Mountain Car
Table 3. Result of cart pole
Table 4. Result of Flappy Bird
Fig. 4.
figure 4

Coverage v.s. Optimality

Learning Curves. We apply consecutive testing [9] to analyze the performance of LMUT learning in more detail. We compute the correlation and testing error of LMUT as more transitions for learning are provided (From 0 to 30k) under the active play setting. To adjust the error scale across different game environments, we use Relative Absolute Error (RAE) and Relative Square Error (RSE). We repeat the experiment 10 times and plot the shallow graph in Fig. 5. In the Mountain Car environment, LMUT converges quickly with its performance increasing smoothly in 5 k transitions. But for complex environments like Cart Pole and Flappy Bird, the evaluation metrics fluctuate during the learning process but approach the optimum within 30 k transitions.

5.4 Matching Game Playing Performance

We now evaluate how well a model mimics Q functions in DQN by directly playing the games with them and computing the average reward per episode. (The games in OpenAI Gym toolkit are divided into episodes that start when a game begins and terminate when: (1) the player reaches the goal, (2) fails for a fixed number of times or (3) the game time passes a preset threshold). Specifically, given an input signal \(I_{t}\), we obtain Q values from the mimic model and select an action \(a_{t}=\max _{a}Q(I_{t},a)\). By executing \(a_{t}\) in the current game environment, we receive a reward \(r_{t}\) and next observation signal \(I_{t+1}\). This process is repeated until a game episode terminates. This experiment uses Average Reward Per Episodes (ARPE), a common evaluation metric that has been applied by both DRL models [14] and OpenAI Gym tookit [3], to evaluate mimic models. In the Experience Training setting, the play performance of CART, M5-RT, M5-MT, FIMT, FIMT-AF and our LMUT are evaluated and compared by partial 10-fold cross evaluation, where we select 9 sections of data to train the mimic models and test them by directly playing another 100 games. For the Active play, only the online methods FIMT and FIMT-AF are compared, without the Flappy Bird environment (as discussed in Sect. 5.3). Here we train the mimic models with 30k transitions, and test them in another 100 games.

Fig. 5.
figure 5

Consecutive testing of LMUT

Table 5 shows the results for game playing performance. We first experiment with learning a Continuous U-Tree (CUT) directly using reinforcement learning [20] instead of mimic learning. CUT converges slowly with limited performance, especially in the high-dimensional Flappy Bird environment. This shows the difficulty of directly constructing a tree model from the environment.

We find that among all mimic methods, LMUT achieves the Game Play Performance APER closest to the DQN. Although the batch learning models have strong fidelity in regression, they do not perform as well in game playing as the DQN. Game playing observation shows that the batch learning models (CART, M5-RT, M5-MT) are likely to choose sub-optimal actions in some key scenarios (e.g., when a pole tilts to one side with high velocity in Cart Pole.). This is because the neural net controller selects many sub-optimal actions at the beginning of training, so the early training experience contains many sub-optimal state-action pairs. The batch models fit the entire training experience equally, while our LMUT fits more closely the most recently generated transitions from a mature controller. More recent transitions tend to correspond to optimal actions. The FIMT algorithms keep adapting to the most recent input only, and fail to build adequate linear models on their leaf nodes. Compared to them, LMUT achieves a sweet spot between optimality and coverage (Fig. 4).

Table 5. Game playing performance

6 Interpretability

We discuss how to interpret a DRL model through analyzing the knowledge stored in the transparent tree structure of LMUT: computing feature influence, analyzing the extracted rules and highlighting the super-pixels.

6.1 Feature Influence

Feature importance is one of the most common interpretation tools for tree-based models [5, 21]. In an LMUT model, splitting thresholds define partition cells for input signals. We evaluate the influence of a splitting feature by the total variance reduction of the Q values. Recall that our LMUT learning algorithm estimates a weight \(w_{Nf}\) for every node and feature f in the model tree (Algorithm 1 Part II). The magnitude of these node weights provides extra knowledge about feature importance. To measure the influence of f on N, we multiply f’s standardized square weight by the expected Variance Reduction from splitting N on f:

$$\begin{aligned} {Inf}_{f}^{N} =(1+\frac{w_{Nf}^{2}}{\sum _{j=1}^{J}w_{Nj}^{2}})(var_{N} -\sum _{c=1}^{C}\frac{Num_{c}}{\sum _{i=1}^{C}Num_{i}}var_{c}), \end{aligned}$$
(3)

where \(Num_{c}\) is the number of training instances assigned to child node c and \(var_{N}\) is the variance of Q values at node N. The total influence of a feature \( {Inf}_{f}\) is the sum of \( {Inf}_{f}^{N}\) for all nodes N split by f in our LMUT. For Mountain Car and Cart Pole, we report the feature influences in Table 6. The most important feature for Mountain Car and Cart Pole are Velocity and Pole Angle respectively, which matches the common understanding of the domains. For Flappy Bird the observations are 80*80 pixel images, so LMUT uses pixels as splitting features. Figure 6 illustrates the pixels with feature influences \( {Inf}_{f}>0.008\) (the mean of all feature influences). Because locating the bird is very important. the most influential pixels are located on the top left where the bird is likely to be.

Table 6. Feature influence
Fig. 6.
figure 6

Super pixels in Flappy Bird

6.2 Rule Extraction

Rule extraction is a common method to distill knowledge from tree models [2, 7]. We extract and analyze rules for the Mountain Car and Cart Pole environment. Figure 7 (top) shows three typical examples of extracted rules in Mountain Car environment. The rules are presented in the form of partition cells (splitting intervals in LMUT). Each cell contains the range of velocity, position and a Q vector (\(\mathbf {Q}=\langle Q_{move \_left}, Q_{no\_ push}, Q_{move\_ right}\rangle \)) representing the average Q-value in the cell. The top left example is a state where the cart is moving toward the left hill with very small velocity. The extracted rule suggests pushing right (\(Q_{move\_ right}\) has the largest value −29.4): the cart is almost stopped on the left, and by pushing right, it can increase its momentum. The top middle example illustrates a state where the car is approaching the top of the left hill with larger left side velocity (compared to the first example). In this case, however, the cart should be pushed left (\(Q_{move\_ left}=-25.2\) is the largest), in order to store more Gravitational Potential Energy and prepare for the final rush to the target. The rush will lead to the state shown in the top right image, where the cart should be pushed right to reach the target. Notice that the fewer steps are required to reach the target in a given state, the larger its Q-value.

Fig. 7.
figure 7

Examples of rule extraction for Mountain Car and Cart Pole.

Figure 7 (bottom) shows three examples of extracted rules in the Cart Pole environment. Each cell contains the scope of cart position, cart velocity, pole angle, pole velocity and a Q vector (\(\mathbf {Q}=\langle Q_{push \_left}, Q_{push\_ right}\rangle \)). The key for Cart Pole is using inertia and acceleration to balance the pole. The bottom left example illustrates the tree rule that the cart should be pushed right (i.e., (\(Q_{push\_ right}>Q_{push \_left}\)), if the pole tilts to the right with a velocity less than 0.5. A similar scenario is the second example, where the pole is also tilting to the right but has velocity towards the left. The Q-values correctly indicate that we should push right to maintain this trend; even if the cart is close to the right-side border, which makes its Q values smaller than in the first example. The third example describes a case where a pole tilts to the left with velocity towards the right. The model correctly selects a left push to achieve a left acceleration.

6.3 Super-Pixel Explanation

In video games, DRL models take the raw pixels from four consecutive images as input. To mimic the deep models, our LMUT also learns on four continuous images and performs splits directly on raw pixels. Deep models for image input can be explained by super-pixels [16]. We highlight the pixels that have feature influence \( {Inf}_{f}>0.008\) (the mean of all feature influences) along the splitting path from root to the target partition cell. Figure 8 provides two examples of input images with their highlighted pixels at the beginning of game (top) and in the middle of game (bottom). Most splits are made on the first image which reflects the importance of the most recent input. The first image is often used to locate the pipes (obstacles) and the bird, while the remaining three images provide further information about the bird’s location and velocity.

Fig. 8.
figure 8

Flappy Bird input images with Super-pixels (marked with red stars). The input order of four consecutive images is left to right.

7 Conclusion

This work introduced a mimic learning framework for a Reinforcement Learning Environment. A novel Linear Model U-tree represents an interpretable model with the expressive power to approximate a Q-value function learned by a deep neural net. We introduced a novel on-line LMUT mimic learning algorithm based on stochastic gradient descent. Empirical evaluation compared LMUT with five baseline methods on three different Reinforcement Learning environments. The LMUT model achieved the clearly best match to the neural network in terms of its performance on the RL task. We illustrated the ability of LMUT to extract the knowledge implicit in the neural network model, by (1) computing the influence of features, (2) analyzing the extracted rules and (3) highlighting the super-pixels. A direction for future work is to explore variants of our LMUT, for example by adding tree pruning, and by experimenting more extensively with hyper-parameters. Another important topic is sampling strategies for the active play setting, which would illuminate the difference we observed between matching the neural net’s play performance, vs. matching the function it represents.