before we write custom layers in tensorflow lets see the definition of Layers class
From the tf.keras.layers.Layers documentation:
check this link for more arguments... there are other functions also availabel, please check this link for better understanding of it base_layer.py
read mode about super
function here:
Do read this blog for more information: few screenshots from the above blog
There are three ways to implement a model architecture in TF The third and final method to implement a model architecture using Keras and TensorFlow 2.0 is called model subclassing.
Inside of tf.keras the Model
class is the root class used to define a model architecture. Since tf.keras utilizes object-oriented programming, we can actually subclass
the Model class and then insert our architecture definition.
The `Model` class has the same API as `Layer`, with the following differences: It exposes built-in training, evaluation, and prediction loops (model.fit(), model.evaluate(), model.predict()). It exposes the list of its inner layers, via the `model.layers` property. It exposes saving and serialization APIs.
Effectively, the “Layer” class corresponds to what we refer to in the literature as a “layer” (as in “convolution layer” or “recurrent layer”) or as a “block” (as in “ResNet block” or “Inception block”).
Meanwhile, the “Model” class corresponds to what is referred to in the literature as a “model” (as in “deep learning model”) or as a “network” (as in “deep neural network”).
class MyDenseLayer(tf.keras.layers.Layer): def __init__(self, num_outputs, **kwargs): super().__init__(**kwargs) self.num_outputs = num_outputs def build(self, input_shape): self.kernel = self.add_weight("kernel", shape=[int(input_shape[-1]), self.num_outputs]) def call(self, input): print(input.shape,self.kernel.shape) return tf.matmul(input, self.kernel) class MyModel(Model): def __init__(self, num_inputs, num_outputs, rnn_units): super().__init__() self.dense = MyDenseLayer(num_outputs, name='myDenseLayer') # not thet we can't use LSTM layer or RNN layer, if you want to use LSTM, you need to write like this # self.lstmcell = tf.keras.layers.LSTMCell(rnn_units) # self.rnn = RNN(self.lstmcell) self.softmax = Softmax() def call(self, input): # output = self.rnn(input) output = self.dense(input) output = self.softmax(output) return output import numpy as np data = np.zeros([10, 5]) y = np.zeros([10,2]) model = MyModel(num_inputs=5, num_outputs=2, rnn_units=32) loss_object = tf.keras.losses.BinaryCrossentropy() optimizer = tf.keras.optimizers.Adam() model.compile(optimizer=optimizer,loss=loss_object) model.fit(data,y, steps_per_epoch=1) model.summary()
Output:
Encoder Architecture
Decoder Architecture
Custom Model Architecture
Model compiling and Training
model = MyModel(encoder_inputs_length=10,decoder_inputs_length=10,output_vocab_size=500) ENCODER_SEQ_LEN = 30 DECODER_SEQ_LEN = 20 input = np.random.randint(0, 499, size=(2000, ENCODER_SEQ_LEN)) output = np.random.randint(0, 499, size=(2000, DECODER_SEQ_LEN)) target = tf.keras.utils.to_categorical(output, 500) # loss_object = loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none') optimizer = tf.keras.optimizers.Adam() model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy') model.fit([input, output], output, steps_per_epoch=1) """ or you can try this model.compile(optimizer=optimizer,loss='categorical_crossentropy') model.fit([input, output], target, steps_per_epoch=1) """ model.summary()
Model output/verbose
-------------------- ENCODER -------------------- ENCODER ==> INPUT SQUENCES SHAPE : (?, 30) ENCODER ==> AFTER EMBEDDING THE INPUT SHAPE : (?, 30, 50) --------------------------- ENCODER ==> OUTPUT SHAPE (?, 30, 64) ENCODER ==> HIDDEN STATE SHAPE (?, 64) ENCODER ==> CELL STATE SHAPE (?, 64) -------------------- DECODER -------------------- DECODER ==> INPUT SQUENCES SHAPE : (?, 20) WE ARE INITIALIZING DECODER WITH ENCODER STATES : (?, 64) (?, 64) --------------------------- FINAL OUTPUT SHAPE (?, 20, 500) --------------------------- 1/1 [==============================] - 4s 4s/step - loss: 6.2145 _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= encoder (Encoder) multiple 54440 _________________________________________________________________ decoder (Decoder) multiple 54440 _________________________________________________________________ dense (Dense) multiple 32500 ================================================================= Total params: 141,380 Trainable params: 141,380 Non-trainable params: 0 _________________________________________________________________
The Seq2Seq framework relies on the encoder-decoder paradigm. The encoder encodes the input sequence, while the decoder produces the target sequence
Encoder
Our input sequence is how are you
. Each word from the input sequence is associated to a vector \( w \in \mathbb{R}^d \) (via a lookup table). In our case, we have 3 words, thus our input will be transformed into \( [w_0, w_1, w_2] \in \mathbb{R}^{d \times 3} \). Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation \( e \). Let’s write the hidden states \( [e_0, e_1, e_2] \) (and thus \( e = e_2 \))
Decoder
Now that we have a vector \( e \) that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: \( e \) as hidden state and a special start of sentence vector \( w_{sos} \) as input. The LSTM computes the next hidden state \( h_0 \in \mathbb{R}^h \). Then, we apply some function \( g : \mathbb{R}^h \mapsto \mathbb{R}^V \) so that \( s_0 := g(h_0) \in \mathbb{R}^V \) is a vector of the same size as the vocabulary.
Then, apply a softmax to \( s_0 \) to normalize it into a vector of probabilities \( p_0 \in \mathbb{R}^V \) . Now, each entry of \( p_0 \) will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus \( i_0 = \operatorname{argmax}(p_0) $\) corresponds to the index of “comment”). Get a corresponding vector \( w_{i_0} = w_{comment} \) and repeat the procedure: the LSTM will take \( h_0 \) as hidden state and \( w_{comment} \) as input and will output a probability vector \( p_1 \) over the second word, etc.
The decoding stops when the predicted word is a special end of sentence token.
Intuitively, the hidden vector represents the “amount of meaning” that has not been decoded yet.
The above method aims at modelling the distribution of the next word conditionned on the beginning of the sentence
by writing
Note: in the simple venilla seq-seq models, we will pass the last time step hidden and cell states to the decoder, instead of that, we can do avg-pooling or max-pooling of all the hidden states of encoder and then pass the results as the inputs to the decoder.
the code in the section 3 will in an implementation of the above concept.
Output
============================== Inference ============================== ENCODER ==> INPUT SQUENCES SHAPE : (1, 30) ENCODER ==> AFTER EMBEDDING THE INPUT SHAPE : (1, 30, 50) -------------------- started predition -------------------- at time step 0 the word is 0 at time step 0 the word is [[55]] at time step 0 the word is [[55]] at time step 0 the word is [[55]] at time step 0 the word is [[9]] at time step 0 the word is [[50]] at time step 0 the word is [[18]] at time step 0 the word is [[23]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[56]] at time step 0 the word is [[25]] at time step 0 the word is [[63]] at time step 0 the word is [[25]] at time step 0 the word is [[12]] at time step 0 the word is [[12]] at time step 0 the word is [[3]]