Encoder - Decoder Architecture with subclassing API

1. Writing a custom Layers

before we write custom layers in tensorflow lets see the definition of Layers class

From the tf.keras.layers.Layers documentation:

this is the class from which all layers inherit.
A layer is a class implementing common neural networks operations, such as convolution, batch norm, etc. These operations require managing weights, losses, updates, and inter-layer connectivity.
Users will just instantiate a layer and then treat it as a callable.
We recommend that descendants of Layer implement the following methods:

check this link for more arguments

base_layer.py

1.1 Example

read mode about super function here:

1.2 Resources

Do read this blog for more information: few screenshots from the above blog

2. Writing a custom Model

There are three ways to implement a model architecture in TF The third and final method to implement a model architecture using Keras and TensorFlow 2.0 is called model subclassing.

Inside of tf.keras the Model class is the root class used to define a model architecture. Since tf.keras utilizes object-oriented programming, we can actually subclass the Model class and then insert our architecture definition.

    The `Model` class has the same API as `Layer`, with the following differences:
        It exposes built-in training, evaluation, and prediction loops (model.fit(), model.evaluate(), model.predict()).
        It exposes the list of its inner layers, via the `model.layers` property.
        It exposes saving and serialization APIs.

Effectively, the “Layer” class corresponds to what we refer to in the literature as a “layer” (as in “convolution layer” or “recurrent layer”) or as a “block” (as in “ResNet block” or “Inception block”).

Meanwhile, the “Model” class corresponds to what is referred to in the literature as a “model” (as in “deep learning model”) or as a “network” (as in “deep neural network”).

2. 1 Example

class MyDenseLayer(tf.keras.layers.Layer):
    def __init__(self, num_outputs, **kwargs):
        super().__init__(**kwargs)
        self.num_outputs = num_outputs

    def build(self, input_shape):
        self.kernel = self.add_weight("kernel", shape=[int(input_shape[-1]), self.num_outputs])

    def call(self, input):
        print(input.shape,self.kernel.shape)
        return tf.matmul(input, self.kernel)


class MyModel(Model):
    def __init__(self, num_inputs, num_outputs, rnn_units):
        super().__init__()
        self.dense = MyDenseLayer(num_outputs, name='myDenseLayer')
        
        # not thet we can't use LSTM layer or RNN layer, if you want to use LSTM, you need to write like this
        # self.lstmcell = tf.keras.layers.LSTMCell(rnn_units)
        # self.rnn = RNN(self.lstmcell)
        
        self.softmax = Softmax()

    def call(self, input):
        
        # output = self.rnn(input)
        
        output = self.dense(input)
        output = self.softmax(output)
        return output

import numpy as np
data = np.zeros([10, 5])
y = np.zeros([10,2])

model  = MyModel(num_inputs=5, num_outputs=2, rnn_units=32)

loss_object = tf.keras.losses.BinaryCrossentropy()
optimizer = tf.keras.optimizers.Adam()

model.compile(optimizer=optimizer,loss=loss_object)
model.fit(data,y, steps_per_epoch=1)

model.summary()

Output:

3. Building Encoder-Decoder Architecture with the custom layers

Encoder Architecture

Decoder Architecture

Custom Model Architecture

Model compiling and Training

model  = MyModel(encoder_inputs_length=10,decoder_inputs_length=10,output_vocab_size=500)

ENCODER_SEQ_LEN = 30
DECODER_SEQ_LEN = 20

input = np.random.randint(0, 499, size=(2000, ENCODER_SEQ_LEN))
output = np.random.randint(0, 499, size=(2000, DECODER_SEQ_LEN))
target = tf.keras.utils.to_categorical(output, 500)

# loss_object = loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
optimizer = tf.keras.optimizers.Adam()

model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy')

model.fit([input, output], output, steps_per_epoch=1)

"""
or you can try this

model.compile(optimizer=optimizer,loss='categorical_crossentropy')
model.fit([input, output], target, steps_per_epoch=1)

"""
model.summary()

Model output/verbose

-------------------- ENCODER --------------------
ENCODER ==> INPUT SQUENCES SHAPE : (?, 30)
ENCODER ==> AFTER EMBEDDING THE INPUT SHAPE : (?, 30, 50)
---------------------------
ENCODER ==> OUTPUT SHAPE (?, 30, 64)
ENCODER ==> HIDDEN STATE SHAPE (?, 64)
ENCODER ==> CELL STATE SHAPE (?, 64)
-------------------- DECODER --------------------
DECODER ==> INPUT SQUENCES SHAPE : (?, 20)
WE ARE INITIALIZING DECODER WITH ENCODER STATES : (?, 64) (?, 64)
---------------------------
FINAL OUTPUT SHAPE (?, 20, 500)
---------------------------
1/1 [==============================] - 4s 4s/step - loss: 6.2145
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
encoder (Encoder)            multiple                  54440
_________________________________________________________________
decoder (Decoder)            multiple                  54440
_________________________________________________________________
dense (Dense)                multiple                  32500
=================================================================
Total params: 141,380
Trainable params: 141,380
Non-trainable params: 0
_________________________________________________________________

4. Vanilla Seq2Seq

4.1 Seq2Seq in training

The Seq2Seq framework relies on the encoder-decoder paradigm. The encoder encodes the input sequence, while the decoder produces the target sequence

Encoder

Our input sequence is how are you. Each word from the input sequence is associated to a vector $ w \in \mathbb{R}^d $ (via a lookup table). In our case, we have 3 words, thus our input will be transformed into $ [w_0, w_1, w_2] \in \mathbb{R}^{d \times 3} $. Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation $ e $. Let’s write the hidden states $ [e_0, e_1, e_2] $ (and thus $ e = e_2 $)

Decoder

Now that we have a vector $ e $ that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: $ e $ as hidden state and a special start of sentence vector $ w_{sos} $ as input. The LSTM computes the next hidden state $ h_0 \in \mathbb{R}^h $. Then, we apply some function $ g : \mathbb{R}^h \mapsto \mathbb{R}^V $ so that $ s_0 := g(h_0) \in \mathbb{R}^V $ is a vector of the same size as the vocabulary.

$\begin{align*} h_0 &= \operatorname{LSTM}\left(e, w_{sos} \right)\\ s_0 &= g(h_0)\\ p_0 &= \operatorname{softmax}(s_0)\\ i_0 &= \operatorname{argmax}(p_0)\\ \end{align*}$

Then, apply a softmax to $ s_0 $ to normalize it into a vector of probabilities $ p_0 \in \mathbb{R}^V $ . Now, each entry of $ p_0 $ will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus $ i_0 = \operatorname{argmax}(p_0) $$ corresponds to the index of “comment”). Get a corresponding vector $ w_{i_0} = w_{comment} $ and repeat the procedure: the LSTM will take $ h_0 $ as hidden state and $ w_{comment} $ as input and will output a probability vector $ p_1 $ over the second word, etc.

$\begin{align*} h_1 &= \operatorname{LSTM}\left(h_0, w_{i_0} \right)\\ s_1 &= g(h_1)\\ p_1 &= \operatorname{softmax}(s_1)\\ i_1 &= \operatorname{argmax}(p_1) \end{align*}$

The decoding stops when the predicted word is a special end of sentence token.

Vanilla Decoder

Intuitively, the hidden vector represents the “amount of meaning” that has not been decoded yet.

The above method aims at modelling the distribution of the next word conditionned on the beginning of the sentence

$\mathbb{P}\left[ y_{t+1} | y_1, \dots, y_{t}, x_0, \dots, x_n \right]$

by writing

$\mathbb{P}\left[ y_{t+1} | y_t, h_{t}, e \right]$

Note: in the simple venilla seq-seq models, we will pass the last time step hidden and cell states to the decoder, instead of that, we can do avg-pooling or max-pooling of all the hidden states of encoder and then pass the results as the inputs to the decoder.

the code in the section 3 will in an implementation of the above concept.

4.2 Inference

Output

============================== Inference ==============================
ENCODER ==> INPUT SQUENCES SHAPE : (1, 30)
ENCODER ==> AFTER EMBEDDING THE INPUT SHAPE : (1, 30, 50)
-------------------- started predition --------------------
at time step 0 the word is 0
at time step 0 the word is  [[55]]
at time step 0 the word is  [[55]]
at time step 0 the word is  [[55]]
at time step 0 the word is  [[9]]
at time step 0 the word is  [[50]]
at time step 0 the word is  [[18]]
at time step 0 the word is  [[23]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[56]]
at time step 0 the word is  [[25]]
at time step 0 the word is  [[63]]
at time step 0 the word is  [[25]]
at time step 0 the word is  [[12]]
at time step 0 the word is  [[12]]
at time step 0 the word is  [[3]]

« 5 Loss functions 7 Seq to Seq models »

Overblown Concepts of ML