Visualising State Space Representations of LSTM Networks

Emmanuel M. Smith¹, Jim Smith¹, Phil Legg¹, Simon Francis²

¹ Department of Computer Science and Creative Technologies
University of the West of England, Bristol, United Kingdom

² Montvieux Limited, Gloucestershire, United Kingdom

Long Short-Term Memory (LSTM) networks have proven to be one of the most effective models for making predictions on sequence-based tasks [1, 2]. These models work by capturing, remembering, and forgetting information relevant to their future predictions. The non-linear complexity of the mechanisms involved in this process means we currently lack tools for achieving interpretability [3-5]. Ideally, we want these models to provide an explanation of why they make a particular prediction, given a specific input. Researchers have explored the idea of interpreting LSTMs in specific contexts such as Natural Language Processing (NLP) or classification [6-8], but they put minimal focus on approaches which are generalisable across different applications. To alleviate this, in this work, we demonstrate a method which enables the interpretation and comparison of LSTM states during time series predictions. We show that by reducing the dimensionality of network states one can scalably visualise patterns and explain model behaviours.

In this notebook we first train a small LSTM network on a basic square wave prediction task and allude to the complexities which arise whilst attempting to identify patterns in the states of LSTMs. We then train a larger LSTM network on the same problem to demonstrate how this problem is exacerbated by increasing the amount of nodes in a network. Next, we discuss the novel concept of using state space representations which allow for the visualisation and comparison of network states in a scalable manner. We first apply this technique towards the initial square wave problem to demonstrate the improvements it offers over examining the unmodified states. After this, we apply the same approach to a more complex signal which involves alternating wave types. Finally, we utilise the visualisation technique on a real-world dataset which involves water usage readings from a university campus.

Basic Examples of LSTM Interpretability¶

The entire purpose of developing new models is to obtain more accurate and powerful tools for predicting signals. Whether those signals are a stream of words, a sensor reading, or a constructed signal, our model must first be able to predict accurately before we can even concern ourselves with the interpretation of those predictions. To demonstrate how one might go about naively interpreting an LSTM model, we first consider a simplified scenario. The scenario involves predicting a standard square wave. Each period contains 16 samples. The model is presented with each sample individually, as input, and it has to predict what the value will be for the next sample in the series.

# Allow for the collapse of the code cells.
from IPython.display import HTML
HTML('''<script>code_show=true;function code_toggle(){if(code_show){$('div.input').hide();}else{$('div.input').show();}code_show = !code_show;} $(document).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

import copy
import datetime
import numpy
import pandas
import pickle
import plotly.graph_objs as graphs
import plotly.offline as plotly
import random
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import torch
import torch.autograd as autograd
import torch.nn as nn


class WaveLSTM(nn.Module):
    '''Allow for dynamically sizing the LSTM sizes and state extraction..'''
    def __init__(self, features=1, layers=1, nodes=2):
        super().__init__()
        self.features = features
        self.nodes = nodes
        self.hidden_layers = nn.ModuleList([nn.LSTM(self.features, self.nodes)])
        for _ in range(layers - 1):
            self.hidden_layers.append(nn.LSTM(self.nodes, self.nodes))
        self.output_layer = nn.Linear(self.nodes, self.features)
        
        self.states = [None for _ in range(layers)]
        self.reset_state()
            

    def reset_state(self):
        for l in range(len(self.hidden_layers)):
            self.states[l] =  (autograd.Variable(torch.zeros(1, 1, self.nodes)),
                               autograd.Variable(torch.zeros(1, 1, self.nodes)))
    
    def forward(self, input_):
        output, self.states[0] = self.hidden_layers[0](input_.view(1, 1, -1), self.states[0])
        for l in range(1, len(self.hidden_layers)):
            output, self.states[l] = self.hidden_layers[l](output, self.states[l])
        return self.output_layer(output.view(1, -1))


def set_random_seeds(seed=0):
    '''Ensure reproducability.'''
    random.seed(seed)
    numpy.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    
def make_wave(values, tiles=1, offset=None):
    '''Generate repeating wave pattern from given values.'''
    # Randomly generate phase offset if not given.
    if offset is None: 
        length = len(values)
        offset = int(get_random_normal(length//2, length//2, 0, length-1))
    
    # Repeat wave pattern and apply offset.
    values = numpy.tile(values, tiles)
    values = numpy.roll(values, offset, axis=0)
    
    return numpy.arange(len(values)), values


def normalise_wave(wave):
    '''Normalise values for optimal neural network inputs.'''
    values = wave[1]
    if values.ndim == 1:
        values = values.reshape((-1, 1))
        values = normalize(values, norm='max', axis=0)
        values = values.reshape((-1))
    else:
        values = normalize(values, norm='max', axis=0)
    return wave[0], values * 2 - 1


def make_sine(steps=10, tiles=1, offset=0, frequency=1):
    '''Generate a repreating sine wave pattern.'''
    times = numpy.linspace(0, tiles, int(steps*tiles))
    values = numpy.sin(2 * numpy.pi * frequency * times)
    return make_wave(values, 1, offset)


def make_square(steps=10, tiles=1, offset=None):
    '''Generate a repreating square wave pattern.'''
    values = numpy.concatenate((numpy.zeros(steps//2), numpy.ones(steps//2)))
    return make_wave(values, tiles, offset)


def make_saw(steps=10, tiles=1, offset=None):
    '''Generate a repreating saw wave pattern.'''
    values = numpy.arange(steps)
    return make_wave(values, tiles, offset)


def make_triangle(steps=10, tiles=1, offset=None):
    '''Generate a repreating triangle wave pattern.'''
    values = numpy.concatenate((numpy.arange((steps//2) + 1), numpy.arange(steps//2)[:0:-1]))
    return make_wave(values, tiles, offset)


def combine_waves(a, b, tiles=1, offset=None, normalise=True):
    '''Combine two waves to be presented as input features.'''
    # Normalise waves to same scale and range.
    if normalise:
        a, b = normalise_wave(a), normalise_wave(b)
        
    min_length = min(len(a[1]), len(b[1]))
    values = numpy.column_stack((a[1][:min_length], b[1][:min_length]))
    return make_wave(values, tiles=(tiles, 1), offset=offset)


def attach_waves(a, b, tiles=1, offset=None, normalise=True):
    '''Attach one wave to the tail end of another.'''
    # Normalise waves to same scale and range.
    if normalise:
        a, b = normalise_wave(a), normalise_wave(b)
        
    values = numpy.concatenate((a[1], b[1]))
    return make_wave(values, tiles=tiles, offset=offset)
    

def make_traces_2d(trace_data):
    '''Generate data traces for a 2D scatter graph.'''
    traces = []
    for trace in trace_data:
        traces.append(graphs.Scatter(**trace))
    return traces


def make_figure_2d(traces, name, x_axis, y_axis, annotations=[]):
    '''Generate plotable figure of a 2D scatter graph.'''
    layout = graphs.Layout(
        title=name,
        xaxis=graphs.XAxis(title=x_axis, rangeslider=dict()),
        yaxis=graphs.YAxis(title=y_axis),
        annotations=annotations,
    )
    return graphs.Figure(data=traces, layout=layout)


def extract_activations(states, layer=None, state_type='hidden'):
    '''Extract the internal activations from the recorded states of an LSTM model.'''
    if state_type: 
        state_type = {'hidden': 0, 'cell': 1}[state_type]
    
    activations = []
    for state in states:
        if layer:
            activation = state[layer][state_type].data.numpy().squeeze()
        else:
            activation = state[0][state_type].data.numpy().squeeze()
            activation = numpy.concatenate((activation, state[1][state_type].data.numpy().squeeze()))
        activations.append(activation)
        
    return numpy.array(activations)


def get_random_normal(mean, deviation, minimum=numpy.nan, maximum=numpy.nan):
    '''Generate a bounded random number from a normal distribution.'''
    result = numpy.random.normal(mean, deviation)
    result = min(result, maximum)
    result = max(result, minimum)
    return result



def make_predictions(data, seed_amount, wave_length):
    model.reset_state()
    outputs = []
    states = [copy.copy(model.states)]
    inputs = autograd.Variable(torch.Tensor(data[1][:seed_amount]))
    for input_ in inputs:
        output = model(input_.view(1, 1, -1))
        states.append(copy.copy(model.states))
        outputs.append(output)
    for _ in range(len(data[1]) - seed_amount):
        output = model(output)
        states.append(copy.copy(model.states))
        outputs.append(output)
    return outputs, states

               
def make_prediction_plot(data, outputs, wave_length, seed_amount, title, annotations=[], show_prediction=True):
    traces = make_traces_2d([
        dict(
            name='Presentation (Target)', 
            x=data[0][:wave_length+1],
            y=data[1][:wave_length+1],
            mode='lines'),
        dict(
            name='Seeding (Target)', 
            x=data[0][wave_length:seed_amount+1],
            y=data[1][wave_length:seed_amount+1]),
        dict(
            name = 'Feedback (Target)',
            line = dict(dash='dash'),
            x = data[0][seed_amount:],
            y = data[1][seed_amount:],
        ),
        dict(
            name = 'Prediction (Output)',
            x = data[0][1:],
            y = numpy.array(list(map(lambda x: x.data.numpy(), outputs))).squeeze(),
            visible=True if show_prediction else 'legendonly',
            opacity=0.8,
            line=dict(width=2),
        ),
    ])
    figure = make_figure_2d(traces, title, 'Timestep', 'Value', annotations=annotations)
    return plotly.iplot(figure)


def get_node_activations(states, seed_weighting):
    activations = {
        'network_hidden': extract_activations(states, state_type='hidden'),
        'network_cell': extract_activations(states, state_type='cell'),
        'layer-0_hidden': extract_activations(states, layer=0, state_type='hidden'),
        'layer-0_cell': extract_activations(states, layer=0, state_type='cell'),
        'layer-1_hidden': extract_activations(states, layer=1, state_type='hidden'),
        'layer-1_cell': extract_activations(states, layer=1, state_type='cell'),
    }

    activations_reduced = {}
    for (name, values) in activations.items():
        reducer = PCA(n_components=2)
        activations_reduced[name] = reducer.fit_transform(values[:int(seed_weighting*len(values))])  # Normalisation affects visualisations negatively.
        activations_reduced[name] = numpy.append(activations_reduced[name], reducer.transform(values[int(seed_weighting*len(values)):]), 0)
    node_activations = []
    for node in range(activations['network_cell'].shape[1]):
        node_activations.append(activations['network_cell'][:, node])
    return activations, activations_reduced, node_activations


def make_state_activation_plot(data, node_activations, title, annotations=[], color=-1):
    node_traces = []
    for i, node_activation in enumerate(node_activations):
        scatter = graphs.Scattergl(
            name = 'Node {}'.format(i),
            x = data[0],
            y = node_activation,
            mode = 'lines',
            line = dict(color='rgba(1, 1, 1, 0.33)' if i != color else 'rgba(1, 1, 1, 0.90)', width=1),
        )
        node_traces.append(scatter)
    layout = graphs.Layout(
        title=title,
        xaxis=graphs.XAxis(title='Timestep'),
        yaxis=graphs.YAxis(title='Activation'),
        hovermode='closest',
        annotations=annotations,
    )
    figure = graphs.Figure(data=node_traces, layout=layout)
    return plotly.iplot(figure)


def make_reduced_state_activation_plot_2D_by_position(data, activations_reduced, wave_length, title, annotations=[]):
    trace_0 = graphs.Scatter(
        name = 'PCA Component 0',
        x = data[0],
        y = activations_reduced['network_cell'][:,0],
        mode = 'lines+markers',
        marker = dict(
            size=4,
            color = numpy.tile(numpy.arange(wave_length), 80),
            colorscale = 'Viridis',
        ),
        line = dict(color='rgba(1, 1, 1, 0.5)', width=1)
    )
    trace_1 = graphs.Scatter(
        name = 'PCA Component 1',
        x = data[0],
        y = activations_reduced['network_cell'][:,1],
        mode = 'lines+markers',
        marker = dict(
            size=4,
            color = numpy.tile(numpy.arange(wave_length), 80),
            colorscale = 'Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Wave Position', ypad=90),
        ),
        line = dict(dash='dot', color='rgba(1, 1, 1, 0.5)', width=1),
    )
    layout = graphs.Layout(
        title=title,
        xaxis=graphs.XAxis(title='Timestep', rangeslider=dict()),
        yaxis=graphs.YAxis(title='Activation'),
        annotations=annotations,
    )
    figure = graphs.Figure(data=[trace_0, trace_1], layout=layout)
    return plotly.iplot(figure)


def make_reduced_state_activation_plot_2D_by_value(data, outputs, activations_reduced, wave_length, title, annotations=[]):
    trace_0 = graphs.Scatter(
        name = 'PCA Component 0',
        x = data[0],
        y = activations_reduced['network_cell'][:,0],
        mode = 'lines+markers',
        marker = dict(
            size=4,
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0),
            colorscale = 'Viridis',
        ),
        line = dict(color='rgba(1, 1, 1, 0.5)', width=1)
    )
    trace_1 = graphs.Scatter(
        name = 'PCA Component 1',
        x = data[0],
        y = activations_reduced['network_cell'][:,1],
        mode = 'lines+markers',
        marker = dict(
            size=4,
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0),
            colorscale = 'Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Value', ypad=90),
        ),
        line = dict(dash='dot', color='rgba(1, 1, 1, 0.5)', width=1),
    )
    layout = graphs.Layout(
        title=title,
        xaxis=graphs.XAxis(title='Timestep', rangeslider=dict()),
        yaxis=graphs.YAxis(title='Activation'),
        annotations=annotations,
    )
    figure = graphs.Figure(data=[trace_0, trace_1], layout=layout)
    return plotly.iplot(figure)


def make_state_space_plot_2D_by_position(data, activations_reduced, wave_length, seed_weighting, title, annotations):
    trace_2D = graphs.Scatter(
        name='Presentation',
        x = activations_reduced['network_cell'][:,0][:wave_length],
        y = activations_reduced['network_cell'][:,1][:wave_length],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][:wave_length])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.tile(numpy.arange(wave_length), 80)[:wave_length],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Wave Position', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
        visible='legendonly'
    )
    trace_2D1 = graphs.Scatter(
        name='Seeding',
        x = activations_reduced['network_cell'][:,0][wave_length:int(seed_weighting*len(data[0]))],
        y = activations_reduced['network_cell'][:,1][wave_length:int(seed_weighting*len(data[0]))],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][wave_length:int(seed_weighting*len(data[0]))])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.tile(numpy.arange(wave_length), 80)[wave_length:int(seed_weighting*len(data[0]))],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Wave Position', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
    )
    trace_2D2 = graphs.Scatter(
        name = 'Feedback',
        x = activations_reduced['network_cell'][:,0][int(seed_weighting*len(data[0])):],
        y = activations_reduced['network_cell'][:,1][int(seed_weighting*len(data[0])):],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][int(seed_weighting*len(data[0])):])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.tile(numpy.arange(wave_length), 80)[int(seed_weighting*len(data[0])):],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Wave Position', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
        visible='legendonly'
    )
    layout = graphs.Layout(
        title = title,
        xaxis = dict(title='PCA Component 0'),
        yaxis = dict(title='PCA Component 1'),
        annotations=annotations,
    )
    figure = graphs.Figure(data=[trace_2D, trace_2D1, trace_2D2], layout=layout)
    return plotly.iplot(figure)



def make_state_space_plot_2D_by_value(data, activations_reduced, wave_length, seed_weighting, title, annotations=[]):
    trace_2D = graphs.Scatter(
        name='Presentation',
        x = activations_reduced['network_cell'][:,0][:wave_length],
        y = activations_reduced['network_cell'][:,1][:wave_length],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][:wave_length])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0)[:wave_length],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Prediction', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
        visible='legendonly'
    )
    trace_2D1 = graphs.Scatter(
        name='Seeding',
        x = activations_reduced['network_cell'][:,0][wave_length:int(seed_weighting*len(data[0]))],
        y = activations_reduced['network_cell'][:,1][wave_length:int(seed_weighting*len(data[0]))],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][wave_length:int(seed_weighting*len(data[0]))])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0)[wave_length:int(seed_weighting*len(data[0]))],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Prediction', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
    )
    trace_2D2 = graphs.Scatter(
        name = 'Feedback',
        x = activations_reduced['network_cell'][:,0][int(seed_weighting*len(data[0])):],
        y = activations_reduced['network_cell'][:,1][int(seed_weighting*len(data[0])):],
        text = list(map(lambda t: 'Timestep: ' + str(t), data[0][int(seed_weighting*len(data[0])):])),
        mode = 'lines+markers',
        marker = dict(
            size='8',
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0)[int(seed_weighting*len(data[0])):],
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Prediction', ypad=75),
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.15)',
            width=1,
        ),
        visible='legendonly'
    )
    layout = graphs.Layout(
        title = title,
        xaxis = dict(title='PCA Component 0'),
        yaxis = dict(title='PCA Component 1'),
        annotations=annotations,
    )
    figure = graphs.Figure(data=[trace_2D, trace_2D1, trace_2D2], layout=layout)
    return plotly.iplot(figure)



def make_state_space_plot_3D_by_position(data, activations_reduced, wave_length, title, annotations=[]):
    trace_2D = graphs.Scatter3d(
        x = activations_reduced['network_cell'][:,0],
        y = activations_reduced['network_cell'][:,1],
        z = data[0],
        mode = 'lines+markers',
        marker = dict(
            size='4',
            color = numpy.tile(numpy.arange(wave_length), 80),
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Wave Position')
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.5)',
        ),
    )
    layout = graphs.Layout(
        title = title,
        height = 1000,
        scene = graphs.Scene(
            xaxis = dict(title='PCA 0'),
            yaxis = dict(title='PCA 1'),
            zaxis = dict(title='Step'),
            annotations = annotations,
        )
    )
    figure = graphs.Figure(data=[trace_2D], layout=layout)
    return plotly.iplot(figure)


def make_water_data(days=2, train=True):
    data = pandas.read_csv('water_data.csv')
    meters = ['Brecon 1']
    meter = random.choice(meters)
    data_meter = data[data['meter']==meter]
    data_meter = data_meter.set_index('date')
    data_meter.index = pandas.to_datetime(data_meter.index)
    data_meter = data_meter.resample('1H').sum()
    data_meter = data_meter.rolling('4h').mean()
    water_data = data_meter['value']['2017-10-28':'2017-11-04']
    water_data = ((water_data - water_data.mean()) / water_data.std()) / 2
    return water_data.index, numpy.array(water_data)


def make_state_space_plot_3D_by_value(data, activations_reduced, wave_length, title, annotations=[]):
    trace_2D = graphs.Scatter3d(
        x = activations_reduced['network_cell'][:,0],
        y = activations_reduced['network_cell'][:,1],
        z = data[0],
        mode = 'lines+markers',
        marker = dict(
            size='4',
            color = numpy.concatenate((numpy.array(outputs[0][0][0].data.numpy()), numpy.array(list(map(lambda x: x[0][0].data.numpy(), outputs))).squeeze()), axis=0),
            colorscale='Viridis',
            showscale=True,
            colorbar=graphs.ColorBar(title='Prediction')
        ),
        line = dict(
            color = 'rgba(1, 1, 1, 0.5)',
        ),
    )
    layout = graphs.Layout(
        title = title,
        height = 1000,
        scene = graphs.Scene(
            xaxis = dict(title='PCA 0'),
            yaxis = dict(title='PCA 1'),
            zaxis = dict(title='Step'),
            annotations = annotations,
        )
    )
    figure = graphs.Figure(data=[trace_2D], layout=layout)
    return plotly.iplot(figure, filename='sine_wave_state_trace')

# Allow for interactive plots.
plotly.init_notebook_mode(connected=True)

#Ensure reproducability.
set_random_seeds(0)

# Plot a square wave example.
data = normalise_wave(make_square(steps=16, tiles=7, offset=0))
traces = make_traces_2d([
    dict(
        name='Input Signal',
        x=data[0],
        y=data[1],
        mode='lines'),
])
annotations=[
    dict(x=15, y=1.3, showarrow=False, text='Double click to reset graph.'),
    dict(x=35, y=-1.3, showarrow=False, text='Click and drag anywhere to zoom into a selection.'),
]
plotly.iplot(make_figure_2d(traces, 'Square Wave Example', 'Timestep', 'Value', annotations))

Before we describe how we formatted the task it first helps to understand what an LSTM network does. A LSTM network is similar in structure to a standard neural network. The basic unit of this network is a node which takes in various inputs and produces a predictive output based on that input. Many of these nodes are given the input signal in parallel to form a layer. The outputs of a layer can then be presented to an additional layer to form a more powerful model. This process can be repeated until the last layer which produces the model’s final prediction. Each LSTM node retains some fundamental information about both the current and historical inputs in the form of two internal values, namely the cell and hidden state activations. When an LSTM network acts upon past information in a prediction, these states are responsible for storing that information. Herein lies the big question of LSTM explainability: how can we determine what information the model is storing when it makes a prediction?

In order to adequately demonstrate approaches which begin to tackle this question, we setup the square wave prediction scenario to contain 3 distinct phases. Once trained to recognise and reproduce the signal, we present the model with the wave, point by point, with its state initialised to zeroes. Due to this, the first phase involves presenting one period of the wave to the network in order to allow it to generate a meaningful state. This phase is called the presentation phase, as we are presenting the network with the signal for the first time. During the next phase we continually provide the network with the appropriate input for 2 periods to demonstrate how its state would evolve in a real-time prediction scenario. We call the second phase seeding as we seed the network with the values of the actual square wave, tasking it with only predicting the value of the sample for the next timestep. The final phase exists solely to assess whether or not the network is able to continue to reproduce the input signal without external correction. To achieve this we simply feed the prediction made by the network back into itself as input for the next timestep, rather than giving it the actual value for the wave. Correspondingly, the final phase is called feedback.

Small Network with Square Wave¶

Since, this problem is particularly trivial we first start with a small pre-trained network which consists of 2 layers with 4 LSTM nodes each, and a final linear layer which combines the outputs of the uppermost LSTM layer into a single prediction. The figure below shows the prediction results across the 3 phases outlined above:

# Simlpe square wave prediction.

# Generate data and load model.
data = normalise_wave(make_square(steps=16, tiles=7, offset=0))
model = WaveLSTM(1, 2, 4)
model.load_state_dict(pickle.load(open('simple_square.p', 'rb')))

# Set parameters relating to data.
wave_length = 16
seed_weighting = 0.6
seed_amount = int(len(data[1]) * seed_weighting)

# Make predictions and set predictions.
outputs, states = make_predictions(data, seed_amount, wave_length)

# Plot results with annotations.
annotations=[
    dict(x=wave_length-2, y=1.05, ax=-10, ay=-40, text='Develop model state.'),
    dict(x=wave_length*3-10, y=-1.05, ax=20, ay=40, text='Continue presenting source signal.'),
    dict(x=wave_length*5+3, y=-1.05, ax=20, ay=40, text="Feedback the model's previous predictions."),
    dict(x=wave_length*7, y=1.1, ax=-120, ay=-30, text="Click legend to show/hide elements."),
]
make_prediction_plot(data, outputs, wave_length, seed_amount, 'Square Wave Prediction Task (Simple)', annotations, show_prediction=False)

From selecting the prediction trace in the legend, we can see that the model has definitely captured and can reproduce the signal; but again, how can we tell what information the network is storing during each timestep? One common approach towards starting to answer this question is to plot the individual cell activation state values for each timestep [3], which we do as follows:

# Simple square wave state activations.

_, _, node_activations = get_node_activations(states, seed_weighting)
annotations=[
    dict(x=wave_length*2+5, y=4.2, showarrow=False, text="Observe how node 6 predicts the signal's repitition."),
    dict(x=wave_length-1, y=-1, ax=-40, ay=90, text='Here,'),
    dict(x=wave_length*2-1, y=-1, ax=-40, ay=90, text='and here,'),
    dict(x=wave_length*3-1, y=-1, ax=-40, ay=90, text='and here...'),
    dict(x=wave_length*7, y=4, ax=-120, ay=-30, text="Double-click node to show/hide the rest."),
]
make_state_activation_plot(data, node_activations, 'State Activations for Square Wave Network (Simple)', annotations, color=6)

If you look closely for patterns, one can see that during every occurrence of the wave switching from high to low, the waves tend to exhibit the same behaviour. These happen every 8 timesteps as the period of our wave is 16 and there is one switch after 8 samples. From this we gain some understanding that our model is in fact capturing useful information in its states. But of course, we currently know the dynamics of our wave and what to look for. If we were unsure of patterns in our network, these dynamics would be hard to elicit through this simplistic visualisation. In fact, before we explore more complex examples, let us first examine how the difficulty of noticing these patterns changes with a more realistically sized network.

Large Network with Square Wave¶

Since the sizes of LSTM networks tend to be much larger than 4 nodes per layer, we now consider how the same cell activation visualisation aids in the understanding of a larger network with 32 nodes per layer. Below are the same two plots, shown for the smaller square wave network, but for the results of the new larger network:

# Simple square wave (large) prediction.

# Generate data and load model.
data = normalise_wave(make_square(steps=16, tiles=7, offset=0))
model = WaveLSTM(1, 2, 32)
model.load_state_dict(pickle.load(open('simple_square_32.p', 'rb')))

# Make predictions and set predictions.
outputs, states = make_predictions(data, seed_amount, wave_length)

# Plot results with annotations.
annotations=[
    dict(x=6, y=-.8, ax=100, ay=40, text='The model has no context yet.'),
    dict(x=wave_length*3-3, y=1.1, ax=-50, ay=-40, text='The larger model better fits the wave pattern.'),
]
make_prediction_plot(data, outputs, wave_length, seed_amount, 'Square Wave Prediction Task (Large)', annotations)

# Simple square wave state activations.

_, activations_reduced, node_activations = get_node_activations(states, seed_weighting)
annotations=[
    dict(x=wave_length*2, y=4.1, showarrow=False, text='Observe how the nodes cross over at the switch points.'),
    dict(x=wave_length*2, y=-.3, ax=-30, ay=135, text='Here,'),
    dict(x=wave_length*3-8, y=-.3, ax=0, ay=135, text='and here,'),
    dict(x=wave_length*3, y=-.3, ax=30, ay=135, text='and here...'),
]
make_state_activation_plot(data, node_activations, 'State Activations for Square Wave Network (Large)', annotations)

From the visualisation of the 64 cell activation states (includes both layers) of the larger square wave network, it is already becoming much harder to notice the dynamic patterns that are indicative of an upcoming switch in the signal. Not only that, but the amount of nodes in modern LSTM networks can be in the range of thousands. Since it is harder to notice patterns in the states of larger networks, this visualisation does not suffice even if we know the dynamics present in the input signal. In order to alleviate this, perhaps we could create a more manageable representation of the states that scales well with the amount of nodes present.

State Space Representations¶

When dealing with visualising the cell activation states in these networks, we basically have a problem of dimensionality. The more nodes we have in the network, the more states we need to examine; and the more states we need to examine, the harder it is notice patterns simply by looking at how the states themselves change over time. A simple solution to this problem is to use a dimensionality reduction technique to minimise the amount of visual clutter such that patterns captured by the networks’ states are easier to identify. That then raises the question, what are the appropriate number of dimensions to maximally reduce visual clutter whilst still eliciting useful visual patterns?

To start, let us consider what the usefulness of 2 dimensions would be. If only 2 dimensions were still able to capture interpretable information about the internal states of the network, we could then create a two dimensional scatter plot. A particular location in this scatter plot would represent a specific network state. If these reduced dimensions do in fact contain useful state information, we would expect closer points in space to represent when the network is remembering similar information. Conversely, if two points are further apart the network should be storing vastly different information. How do we verify this whilst still having no method of actually understanding the states at a fundamental level?

Well, if these premises hold true, then this ‘state space representation’ would allow us to know if the network is progressing towards a state that we’ve encountered before. We still do not know what the states contain exactly, but we would then be able to compare our current state with historical ones. If we intuitively know something about our input signal, such as our square wave, then we already know which states should be in similar locations. Since the wave is repeating and doesn’t change across its periods, the same offset position in every period should ideally be storing the same information in anticipation of which part of the wave is coming next. For example, when the square wave it about to switch from low to high, the network needs to keep track of how long it has been since the drop from high to low, and its states should be reflective of that. So for every switch from low to high (or high to low, or any position for that matter) the network should be capturing that same information, which implies that their positions in our state space should be closer together.

Square Wave State Space¶

Now that we have an idea of what state space representations are, how they could potentially be useful, and how to validate that they are actually representing something useful, let us now apply the concept to our larger square wave model. First we reduce the dimensionality of the cell activations for all of our nodes across the entirety of the network using Principal Component Analysis (PCA). We only train the PCA model on the states during the presentation and seeding phases of our training paradigm, as the feedback phase is representative of future predictions. The following visualisation is the result:

# Simple square wave (Large) PCA state activation plot.

annotations = [
    dict(x=wave_length*2-2, y=10, showarrow=False, text='Observe how we can now differenciate two switch types.'),
    dict(x=wave_length*2+.1, y=2.9, ax=-30, ay=120, text='High to low.'),
    dict(x=wave_length*3-7.2, y=-5, ax=35, ay=40, text='Low to high.'),
]
make_reduced_state_activation_plot_2D_by_position(data, activations_reduced, wave_length, 'PCA State Activations for Square Wave Network (Large)', annotations)

Since we know the dynamics of the repeating square wave, we can label each point with its relative position within its period. This allows us to see whether or not the same position along the wave has similar values for each of the components. Additionally, we can plot the same two components in our state space scatter plot:

# Simple square wave (Large) 2D state space plot.

annotations = [
    dict(x=3.5, y=3.4, ax=-20, ay=50, text='High to low switch points.'),
    dict(x=-5.1, y=-4, ax=30, ay=-50, text='Low to high switch points.'),
]
make_state_space_plot_2D_by_position(data, activations_reduced, wave_length, seed_weighting, 'PCA of Cell Activations for Square Wave Network (Large) by Wave Position', annotations)

From this state space plot, we can see that our assumptions held true. After the initial presentation phase, the network settled on a consistent travel pattern where similar positions along the wave had similar positions in the plot. Additionally, this holds true even during the feedback phase with minor amounts of deviation. This suggests that the model contains an internal representation of the dynamics of the wave, and only uses the input for initially syncing with the phase of the wave and correcting minute deviations. If we transform our plot into 3 dimensions, where the third axis is the time step, we can see this clearer:

# Simple square wave (Large) state space 3D plot.

annotations = [
    dict(x=3.38, y=3.78, z=96, ax=0, ay=-100, text='High to low switch points.'),
    dict(x=-5.05, y=-4.4, z=105, ax=0, ay=-100, text='Low to high switch points.'),
]
make_state_space_plot_3D_by_position(data, activations_reduced, wave_length, 'PCA of Cell Activations for Square Wave Network (Large) by Wave Position', annotations)

Even though this representation is useful for a simple pattern such as a square wave, is it capable of differentiating between multiple patterns in the same input signal?

State Space of Alternating Waves¶

From the two previous visualisations one can see that the network travels a consistent circular path during each period. Given our assumptions, if you consider the states of a model trained on two alternating waves each with their own dynamics, then our state space plot should contain two distinct traveling paths, one for each wave. To test this we train a network of the same architecture on a more complex repeating signal. The signal starts with 2 cycles of the square wave as before, then transitions to 2 cycles of a triangle wave with only 8 samples per wave instead of the 16 of the square wave. Using the same prediction setup as before we obtain the following results:

# Alternating wave prediction.

# Generate data and load model.
data = attach_waves(make_square(steps=16, tiles=2, offset=0), make_triangle(steps=8, tiles=2, offset=0), tiles=7, offset=0)
model = WaveLSTM(1, 2, 32)
model.load_state_dict(pickle.load(open('square_triangle.p', 'rb')))

# Update parameters to match new wave.
wave_length = 48
seed_amount = int(len(data[1]) * seed_weighting)

# Make predictions and set predictions.
outputs, states = make_predictions(data, seed_amount, wave_length)

# Plot results.
make_prediction_plot(data, outputs, wave_length, seed_amount, 'Alternating Wave Task')

From this we can see that the model was again able to learn how to predict the input signal extremely well. Let us now see how our reduced representation changes when presented with more complex dynamics:

# Alternating Wave PCA state activation plot.

_, activations_reduced, node_activations = get_node_activations(states, seed_weighting)

# Plot results with annotations.
annotations = [
    dict(x=wave_length+8, y=12, ax=-20, ay=-50, text='Square wave patterns.'),
    dict(x=wave_length*2-7, y=1, ax=60, ay=-130, text='Triangle wave patterns.'),
]
make_reduced_state_activation_plot_2D_by_position(data, activations_reduced, wave_length, 'PCA State Activations for Alternating Network by Wave Position', annotations)

Again we can see a clear distinction between the different positions within the square wave, however we can also see differences between the square wave cycles and the triangle wave cycles. Below we can also see that those differences are also reflected in the state space plot:

# Alternating wave network 2D state space plot.

annotations = [
    dict(x=-4, y=7.5, ax=250, ay=20, text='First low to high switch points.'),
    dict(x=-3.9, y=6.5, ax=250, ay=50, text='Second low to high switch points.'),
    dict(x=9.95, y=-5.4, ax=-20, ay=-50, text='First high to low switch points.'),
    dict(x=8.95, y=-5.7, ax=-180, ay=-130, text='Second high to low switch points.'),
    dict(x=-3.6, y=-2, ax=150, ay=-10, text='First triangle wave peaks.'),
    dict(x=-3.9, y=-1, ax=40, ay=-50, text='Second triangle wave peaks.'),
]
make_state_space_plot_2D_by_position(data, activations_reduced, wave_length, seed_weighting, 'PCA of Cell State Activations for Alternating Network by Wave Postion', annotations)

Transforming the plot into 3 dimensions again makes the differences more evident:

# Alternating wave 3D state space plot.

annotations = [
    dict(x=-3.93, y=6.5, z=311, ax=0, ay=-100, text='Low to high switch points.'),
    dict(x=9.02, y=-5.94, z=319, ax=0, ay=-100, text='High to low switch points.'),
    dict(x=-3.93, y=-1.24, z=332, ax=0, ay=-100, text='Triangle wave peaks.'),
]
make_state_space_plot_3D_by_position(data, activations_reduced, wave_length, 'PCA of Cell Activations for Alternating Network by Wave Position', annotations)

From our state space representation of the alternating wave scenario, we have provided evidence that this method for visualising and comparing the internal states of LSTM networks extends beyond a single wave pattern. That being said, this scenario still remains rather contrived and unrealistic. Additionally, since we control the period of the waves and when they switch, we can aid in the visualisation process by coloring each state space point with its position in the wave. In more realistic data sources, we do not have that information and thus need a different method for verifying the similarity between different nearby states.

Real-World Applications¶

To identify the utility of this approach in a realistic context, we will use a dataset of water usage in the student accommodation buildings of the University of the West of England. The readings were recorded from water meters in four separate buildings. Each building contains approximately 5-7 water meters. Readings were taken every half an hour over a two semester period. We first separated the dataset into a training and test portion by separating out all of the meters from one of the buildings into the test set. The model is then tasked with providing hourly predictions for a single meter signal. During training, a small 2-day window, randomly selected from the available water meters, is presented to the model as a single example.

The idea is that the model should be able to learn the normal patterns of student water usage, learn to ignore the day to day noise, and the differences in meter locations. For example, if you calculate the average value by hour for each building, you obtain a pattern similar to this:

# Average pattern for one meter.

# Load the water data with an average example.
data = pandas.read_csv('../../data/water_data/water_data.csv')
meter = data[data['meter'] == 'Brecon 1']
meter = meter.set_index('date')
meter.index = pandas.to_datetime(meter.index)
meter = meter.resample('1h').sum()
meter = meter.groupby(meter.index.hour)['value'].mean()

# Plot average data with annotations.
traces = make_traces_2d([dict(
    x=numpy.array(meter.index),
    y=numpy.array(meter),
    mode='lines',
)])
annotations = [
    dict(x=5, y=0.15, ax=-20, ay=-50, text='Early morning dip.'),
    dict(x=12, y=0.5, ax=-20, ay=50, text='Lunchtime peak.'),
    dict(x=20, y=0.45, ax=-20, ay=50, text='Evening peak.'),
]
figure = make_figure_2d(traces, 'Average Water Usage for Single Meter', 'Hour', 'Value', annotations)
plotly.iplot(figure)

As can be expected there is a significant drop in water usage in the early hours of the morning, as well as two spikes around mid-afternoon and in the evening. Even though any particular day on a specific meter can deviate drastically from this pattern, there is still an underlying distribution of usage which is common across all of the meter locations. This is some of the information we should expect the model to be capturing. Perhaps there are other dynamics which become more evident in the state space representation, but first let us start with examining how well our network performs over a 7-day period:

# Water data predictions.

# Generate water data example and load model.
data = make_water_data(7)
model = WaveLSTM(1, 2, 32)
model.load_state_dict(pickle.load(open('water_data3.p', 'rb')))

# Update parameters to match water data.
seed_weighting = 0.85
wave_length = 24
seed_amount = int(len(data[1]) * seed_weighting)

# Make predictions and set predictions.
outputs, states = make_predictions(data, seed_amount, wave_length)

# Plot results with annotations.
annotations = [
    dict(x=datetime.datetime(2017, 10, 30, 5, 0), y=-0.9, ax=-30, ay=30, text='Early morning dip.'),
    dict(x=datetime.datetime(2017, 10, 30, 11, 30), y=0.9, ax=0, ay=-40, text='Lunchtime peak.'),
    dict(x=datetime.datetime(2017, 10, 30, 23, 0), y=0.6, ax=70, ay=-60, text='Evening peak.'),
]
make_prediction_plot(data, outputs, wave_length, seed_amount, 'Water Data Task', annotations)

Next, the initial method for visualising the model’s states:

# Plot of water data states.

_, activations_reduced, node_activations = get_node_activations(states, seed_weighting)
annotations = [
    dict(x=datetime.datetime(2017, 10, 30, 5, 0), y=-1.2, ax=-30, ay=40, text='Early morning dip.'),
    dict(x=datetime.datetime(2017, 10, 30, 11, 30), y=0.7, ax=0, ay=-40, text='Lunchtime peak.'),
    dict(x=datetime.datetime(2017, 10, 30, 23, 0), y=0.5, ax=70, ay=-60, text='Evening peak.'),
]
make_state_activation_plot(data, node_activations, 'State Activations for Water Data Network', annotations)

From the basic plot of the cell activation states of the network’s nodes it is hard to discern when the network is in similar states. Additionally, the large majority of the states seem to have minimal relationship to anything meaningful in the input signal. If we consider creating our state space representation we now encounter a problem. We have no guarantee that the average 24 hour pattern holds true for each individual weekday. The network's state for 20:00 on a Monday could likely be justifiably different from the network’s state at the same time on a Friday. This means we should no longer color our points with their position in the cycle (days in this case). Instead we can use the actual values that the model predicted.

If we utilise the actual values predicted by network we should similarly expect states which are closer in the state space to have similar values, however not as strongly as before. The reason for this is that the position in state space is representative of the information the network is storing, but the network uses both the information that it stores and its current input to determine its next prediction. So if the network is in a similar state and obtains vastly different inputs, then the predicted value can differ as well. That being said, there should still be a strong correlation between the state space position and the predicted values. The state space representation of the water data network is shown below:

# Water data PCA state activation plot.

_, activations_reduced, node_activations = get_node_activations(states, seed_weighting)

# Plot results with annotations.
annotations = [
    dict(x=datetime.datetime(2017, 10, 30, 5, 0), y=-2, ax=-30, ay=30, text='Early morning dip.'),
    dict(x=datetime.datetime(2017, 10, 30, 11, 30), y=1.9, ax=-10, ay=-30, text='Lunchtime peak.'),
    dict(x=datetime.datetime(2017, 10, 30, 23, 0), y=1.2, ax=10, ay=-30, text='Evening peak.'),
]
make_reduced_state_activation_plot_2D_by_value(data, outputs, activations_reduced, wave_length, 'PCA State Activations for Water Data Network by Prediction Value', annotations)

# Water data 2D state space plot.

annotations = [
    dict(x=-1.58, y=0.12, ax=-30, ay=90, text='Early morning dip.'),
    dict(x=1.44, y=.63, ax=10, ay=-60, text='Lunchtime peak.'),
    dict(x=.878, y=.05, ax=5, ay=-130, text='Evening peak.'),
]
make_state_space_plot_2D_by_value(data, activations_reduced, wave_length, seed_weighting, 'PCA of Cell State Activations for Water Data Network by Prediction Value', annotations)

Again transforming the representation into a 3-dimensional plot gives us a better method for seeing how the state trajectory varies throughout the week:

# 3D state space water data.

annotations = [
    dict(x=-1.57, y=.167, z=datetime.datetime(2017, 10, 30, 5, 0), ax=0, ay=-100, text='Early morning dip.'),
    dict(x=1.43, y=.58, z=datetime.datetime(2017, 10, 30, 11, 30), ax=0, ay=100, text='Lunchtime peak.'),
    dict(x=.878, y=.001, z=datetime.datetime(2017, 10, 30, 23, 0), ax=0, ay=-100, text='Evening peak.'),
]
make_state_space_plot_3D_by_value(data, activations_reduced, wave_length, 'PCA of Cell Activations for Water Data by Prediction Value', annotations)

From the above state space plots, we can conclude that our assumptions hold true even for more complex real-world data. Although the dynamics of this dataset are quite complex, we still see a clear correlation between the values predicted by the network and its position in the state space graphs. We can even notice in advance that the network prediction for November 1st at 15:00 would be an outlier, as the state space position of the network starts deviating from the normal trajectory two hours before. Another thing to note is that our network was unable to reproduce an adequate signal during the feedback phase. Instead it just settles in the center of mass of the state space. This could suggest that due to the randomness inherent in the data, the network relies more heavily on its current input rather than its own captured information.

Using dimensionality reduction to try to interpret both neural networks and their relation to their inputs is not an entirely new idea. Rauber et al. used t-SNE to visualise the hidden activations of Convolutional Neural Networks trained on the MNIST dataset [7]. Olah expanded this idea by demonstrating that multiple models using the same network architecture, and trained on the same data set, can produce different representations [8]. Neither of these works attempt to examine how these representations evolve over a sequence-based task.

There have been attempts to provide explanations for sequence-based neural network models, whether at the node-specific level or globally. Strobel et al. produced a visualisation framework for textual sequences that aids in analysing the relationship between the LSTM cell activations and the input sequence [3]. Their method of dealing with the problem of isolating useful cell activations in a large network is to use a thresholded selection which highlights the associated cells. Other works tend to focus on the interpretation of the responsibilities for individual nodes [4, 5].

The main novelty provided by our approach when compared to existing works is that we use dimensionality reduction to interpret the states of the entire network. We avoid focusing on the responsibilities of particular nodes, and instead present an approach which allows for a visual comparison between specific timesteps. This then allows for the identification of high-level patterns captured by the network’s states.

Conclusion¶

During this notebook we explored how dimensionality reduction could be used to create state space representations of LSTM networks. These representations have the advantage that they scale with the amount of nodes in the network and allow for the elicitation of global patterns captured by the network’s internal states. To demonstrate this we first explored a simple square wave prediction task to introduce and provide justification for the approach. We then moved on to a more complex alternating waves scenario which showed that the technique was capable of representing different state patterns. Lastly, we applied the method on a real-world water usage dataset to demonstrate how these state space representations could be used in realistic scenarios. We believe there are a multitude of avenues for developing state space representations further. In future works, we aim to explore using these representations to analyse characteristics of a trained model. For example, by analysing the travel patterns of the network through the state space, we should be able to ascertain the degree to which a model is relying upon the current input signal against its own internal states.

References¶

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.

[2] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A Critical Review of Recurrent Neural Networks for Sequence Learning,” May 2015.

[3] H. Strobelt, S. Gehrmann, B. Huber, H. Pfister, and A. M. Rush, “Visual Analysis of Hidden State Dynamics in Recurrent Neural Networks,” Jun. 2016.

[4] T. Galstyan and H. Khachatrian, “Interpreting Neurons in an LSTM Network,” YerevaNN, 2017. [Online]. Available: http://yerevann.github.io/2017/06/27/interpreting-neurons-in-an-LSTM-network/. [Accessed: 11-Jul-2018].

[5] E. Chen, “Exploring LSTMs,” Edwin Chen, 2017. [Online]. Available: http://blog.echen.me/2017/05/30/exploring-lstms/. [Accessed: 11-Jul-2018].

[6] M. T. Ribeiro, S. Singh, and C. Guestrin, “Model-Agnostic Interpretability of Machine Learning,” Jun. 2016.

[7] P. E. Rauber, S. G. Fadel, A. X. Falcao, and A. C. Telea, “Visualizing the Hidden Activity of Artificial Neural Networks,” IEEE Trans. Vis. Comput. Graph., vol. 23, no. 1, pp. 101–110, Jan. 2017.

[8] “Visualizing Representations: Deep Learning and Human Beings,” colah’s blog, 2015. [Online]. Available: http://colah.github.io/posts/2015-01-Visualizing-Representations/. [Accessed: 11-Jul-2018].