Multivariate LSTM model - LSTM Models to Forecast Usage Parameters of MapReduce

LSTM Models to Forecast Usage Parameters of MapReduce

5.3. MODELS

5.3.2 Multivariate LSTM model

Long short-term memory (LSTM) is an extended version of recurrent neuron networks (RNN). It can learn long-term dependencies and avoid the vanishing gradient problem often seen in traditional RNNs. We use multivariate LSTM models to predict resource usage of four MapReduce applications. Generally, a multivariate LSTM modeling process is divided into two phases: one is to learn optimal Hyperparameters, another one is to fit model and make a prediction. The architecture of LSTM unit (seen Figure 5.2a) shows the applicability of LSTM for learning long-term dependency. The LSTM unit consists of a memory cell (ce_i), an input gate (in_i), an output gate (ou_i) and a forget gate (f o_i) [37].

Each gate substantially represents a sigmoid neural network layer. Based on the functions of these gates, LSTM unit is capable of memorizing dependent characteristics and eliminate a gradient vanishing problem. LSTM transition equations(5.11)-(5.17) are as follows

F orget gate f o_i=σ(W_{f o}·[h_i₋₁, x_i] +b_{f o}) (5.11) Input gate in_i=σ(W_in·[h_i₋₁, x_i] +b_in) (5.12) Output gate ou_i=σ(W_ou·[h_i₋₁, x_i] +b_ou) (5.13) Candidate value cai =tanh(Wca·[hi−1, xi] +bca) (5.14) M emory cell cei =f oi⊗cei−1+ini⊗cat (5.15) Hidden layer vector h_i =ou_i⊗tanh(ce_i) (5.16) Output vector y_i+1 =σ(W_y⊗h_i+b_y) (5.17) wherex_iis an input feature vector at time instanti∆t,σrepresents the activation function (the logistic sigmoid function), W and b denote the weighted value of connections and bias, and ⊗indicates the element-wise multiplication. Overall, the forget gate is used to forgotten extent of the previous memory cell, the input gate controls the updated extent of a unit, the output gate decides the output part of the internal memory state.

An one-hidden-layer model is illustrated in Figure 5.2b.

xi hi-1

foi i tanh

tanh

oui

cei-1 hi i

LSTM unit

cai

(a) LSTM unit

LSTM unit 1 LSTM unit 2 LSTM unit k

...

LSTM unit 1 LSTM unit 2 LSTM unit k

...

LSTM unit 1 LSTM unit 2 LSTM unit k

... ...

xi i+1

xi+1 xN

yi+2 yN+1

h hi+1 hN

hi+1

hi-1

Input layer LSTM(k) layer

Output layer

Note: is input feature vector, ts is time step, N is number of observations

=[ ci-ts+1 si-ts+1 ri-ts+1 wi-ts+1

xi , , , ,...c s r wi, , ,i i i]

is the number of LSTM units and also called neuron number k

(b) Unfold single hidden layer LSTMs structure Figure 5.2. Illustration of long short-term memory networks model

In Figure 5.2, hi represents the hidden layer vector, yi+1 indicates the output result at time instanti+1. A one-hidden-layer LSTM model includes 1 input layer, 1 hidden LSTM

layer, and 1 output layer (the full-connected dense layer). For complex situations, multiple hidden layers can be added to LSTM networks.

5.3.2.1 Hyperparameters learning algorithm

The selection of Hyperparameters of a neural network architecture plays a vital role in the model performance. The state-of-the-art performance usually relies on if the selected parameters are optimal. Simply choosing leads to the wrong prediction. Therefore, the selection of optimal Hyperparameters is an essential step to build a LSTM model.

However, which Hyperparameters should be evaluated and how to select is often a ”black art that requires expert experiences” [93]. Reimers et. al evaluated and compared Hyper-parameters for five common linguistic sequence tagging tasks [82]. They investigated and evaluated many parameters from different aspects, such as optimiser, mini-batch size. Si-multaneously, they identified the impact level of parameters to model performance. Based on the recommendation, we choose six important parameters to determine a valid tuning range for each parameter. Other parameters involved in LSTM networks are set with their optimal recommendation. Like the optimizer is set to ”Adam”. Since Hyperparameter has no automatic methods to calculate, we tuned these parameters manually. In this chapter, six Hyperparameters are:

• Batch size is also called Mini-batch size that reflects the periodicity of data. It determines the number of samples before updating the internal model parameters.

The appropriate batch size might speed up the learning process. In general, a train-ing dataset might comprise one or more batches. Note that traintrain-ing and test data sets size should be exact multiples of batch sizes and the last few records are trun-cated. Furthermore, at the end of each batch, we calculate the error by comparing the predicted values to the expected output variables.

• Epoch size is also called epoch number. One epoch means a complete training data set which might be comprised of one or more batches, while epoch number is a key Hyperparameter of LSTM that represents the iteration times of the learning algorithm working through the entire training dataset. The large epoch number mostly like to achieve a sufficiently small error. The proper epoch size, therefore, can effectively contribute to model an optimisation process.

• Neuron numberis also called LSTM unit number, which represents the amount of neurons in the hidden layer of LSTM. This number represents the learning capacity of networks. Generally, the larger neuron number would be able to learn more structure from the problem but to increase training time. However, it is worth noting that strengthening of learning capacity always being along with the risk of potential overfitting.

• Time stepindicates the lagged number of input features. The presence of large time steps observations as input features might improve learning or predictive capability of the model.

• Hidden layer number is associated with the complicated extent of the resolved problem. The more complex problem needs a number of hidden layers. However, the adding of extra layer typically cannot contribute to a simple problem. The right hidden layer number helps find reasonable complex features.

5.3. MODELS

• Dropout rate is a popular method to deal with overfitting problem in Neural networks. It is usually added into between hidden layer and dense layer. However, the fixed dropout rate is not suitable for each prediction scenario in LSTM prediction.

The learning of appropriate dropout rate is essential for optimisation of LSTM.

The first five Hyperparameters are acquired by implementing the specific optimal learning algorithm. This algorithm is divided into five sub-procedures in Figure 5.4 and 5.5: epoch size (5.4a), batch size (5.4b), neuron number (5.4c), time step (5.4d), and dropout rate (5.5) learning sub-procedure for achieving the optimal parameters. Each sub-procedure includes a common module (algorithm 1) to calculate the mean of validation RMSE for each model. Algorithm 1 is shown in Figure 5.3.

Read time series data (Xi, Si, Ri, Wi)

The observation at previous time step as input to forecast the predicted usage parameter at current

time step

Rescale data to values between -1 and 1

Train model on training data and then use the fitte d model to make prediction on validation data set

Transform the prediction value to original scale

Calculate and store validation RMSE

No If this experimental scenario has already run exceed 5 times?

Yes Calculate and store the mean of 5 stored validation RMSE No

Yes Yes

Split time series data into a training(the first 80%

of total sample), validation(the next 10%), and test (the remaining 10%) data set

Figure 5.3. Common module for calculating the mean of validation RMSE for each forecasting model

However, the Hyperparameter called a hidden layer number of LSTM cannot be readily determined. Therefore, we respectively test a one-hidden-layer and a two-hidden-layer LSTM model. The Hyperparameters learning process mainly relies on which model brings the smallest average validation RMSE (root mean squared error) [14]. According to the structure of the model, a one-hidden-layer LSTM learns the first four parameters and last one, yet a two-hidden-layer LSTM needs tuning seven Hyperparameters. It means that one extra neuron and one extra dropout rate tuning sub-procedure should be added to the search of a two-hidden-layer LSTM.

Note that root mean squared error (RMSE) is used to evaluate prediction accuracy as it punishes significant errors. The lower RMSE represents a higher forecasting accuracy among different models. The RMSE of a model is computed forndifferent predictions as

Start

Initiate epoch=100

batch size=32, time step=5, neuron=100, dropout rate=0.25

Common module call algorithm 1

If current epoch<900?

Select the optimal epoch with the lowest validation RMSE

End

epoch= epoch+200 Yes

Yes

(a)Epoch size learning algorithm

Start

Initiate batch size=8

neuron number=100, time step = 5 , epoch= the obtained optimal epoc h size, dropout rate=0.25

Common module call algorithm 1

If current batch size<64?

Select the optimal batch size with the lowest validation RMSE

End

batch size=batch size*2 Yes

Yes

(b)Batch size learning algorithm

Start

Initiate neuron number=10

time step =5, epoch size = the optimal epoch size, batch size = the optimal batch size, dropout rate=0.25

Common module calls algorithm 1

If current neuron number<130?

Select the optimal neuron number with the lowest validation RMSE

End

neuron number = neuron number+30 Yes

Yes

(c)Neuron number learning algorithm

Start

Initiate time step=1

Common module call algorithm 1

If current time step<7?

Select the optimal time step with the lowest validation RMSE

End

time step=time step+1 Yes

Yes

Yes epoch size=the optimal epoch size, batch size= the optimal batch size, neuron number=the optimal neuron number, dropout rate=0.25

(d)Time step learning algorithm Figure 5.4. Hyperparameters learning algorithm

the square root of the mean of error squares RM SE =

√∑_n

i=1(ˆy_i−y_i)²

n (5.18)

In document BudapestUniversityofTechnologyandEconomics Prof.VanTienDo Supervisor YangYuanLi by Ph.D.Dissertation AMethodtoProcessImagesDataandPredictionModelsforsomeMapReduceApplications (Pldal 78-82)