Evaluation and discussion - BudapestUniversityofTechnologyandEconomics Prof.VanTienDo Superviso

Applications

4.6 Evaluation and discussion

Figure 4.4. Statistical metrics distribution of read rate model of Terasort application

4.6 Evaluation and discussion

4.6.1 Estimated coefficients and statistical metrics

The estimated coefficients of regression models can reflect the dependent pattern between response and predictor while the statistical metrics are used to evaluate the fit quality.

Therefore, the distribution of all estimated coefficients and statistical metrics are showed in Figure 4.5 and 4.6. In these figures, the height of bar represents the value of estimated coefficient or statistical metrics, the pattern is used to distinguish the corresponding usage parameter of regression model.

The top-left panel of Figure 4.5 displays the estimated coefficients distribution of the CPU usage models. The top-right panel, bottom-left panel, bottom-right refer to memory usage, read rate and write rate models, respectively.

In Figure 4.5, the positive dependency of different strength between each resource usage parameter and the corresponding previous usage parameter is exhibited for all MapRe-duce applications. It was observed that all current resource usage parameters are posi-tively dependent on the previous values to some extent degree. Except for these common dependencies, there exists some special dependencies for different applications.

On the top-left panel of Figure 4.5, CPU usage of Pi application shows the strongest pos-itive dependency to lagged CPU usage, the Teragen application had the weakest pospos-itive dependency, and others exhibits the moderate positive dependency. Except for the de-pendency between CPU usage and its lagged variable, the Wordcount application exhibits moderate positive dependency between CPU usage and read rate as well as Wordmean ap-plication. Meanwhile, the Teragen application shows a weak negative relationship between CPU usage and memory usage.

The top-right panel of Figure 4.5 exhibits the dependent characteristics between memory usage and other parameters. Except for Pi application, others showed the extremely highly positive dependency of memory usage to the lagged memory usage. Memory usage of Pi application has a weakly positive dependency to CPU usage except for a moderate

Figure 4.5. Estimated coefficients distribution of regression mod-els

dependency to the lagged memory usage. Memory usage of Grep application exhibits an extremely weak dependency to the lagged CPU usage.

The dependent relationships between read rate and other usage parameters are displayed in the bottom-left panel of Figure 4.5. The dependency between read rate and previous read rate indicates that read rate most likely depends on the lagged read rate for each application. For Wordcount, Wordmean and Wordmedian applications, read rate is neg-ative dependent to write rate. In other words, read rate will significantly decrease as write rate increases. Read rate of Terasort application exhibited the moderate negative relationship with write rate and the moderate positive relationship with the lagged write rate. It implies that read rate is sensitive to the variation of write rate.

The bottom-right panel of Figure 4.5 exhibits the relationship between write rate and other usage parameters for each application. Terasort and Teragen applications have the highest estimated coefficients on lagged write rate. It is most likely due to their frequent write operations. The remaining applications just exhibit the weak relationship between write rate and lagged write rate. Write rate of Pi application also shows a weak positive relationship with lagged CPU usage. Write rate of Terasort application exhibits a strong negative relationship with read rate and positive relationship with lagged read rate. It means that write rate has a relationship with the variation between read rate and lagged read rate. Therefore, each MapReduce application has the various relationship among resource usage parameters.

Turning to the distribution of statistical metrics of regression models, it has been showed in Figure 4.6.

Accordingly, the left and right panels of Figure 4.6 show the residual standard error(RSE) distribution and R² distribution of each application. Both a taller R² bar and a shorter RSE bar represent better fitting quality. The R² almost 1 and small RSE show the

4.6. EVALUATION AND DISCUSSION

Figure 4.6. RSE and R² distribution of regression models

best fit quality of the regression models on memory usage as the response. The overall higher RSE and lower R² of regression models on CPU as the response show the worse quality of fitting goodness. The regression models on read rate as the response also show a moderate fitting quality. For the regression models on write rate as the response, Terasort application exhibits the best quality and Teragen application as well. Others show the worse fitting quality. The results show that the regression models of intensive-resource usage parameters exhibit the good fitting quality.

4.6.2 The minimal sampling time for stable modeling

To investigate the minimal number of samples for stable modeling, we examined the min-imal sampling time of the estimated coefficients and statistical metrics by testing if the error changing rates of them converge to a specific threshold (0.1). The obtained minimal sampling time of them are showed in Figure 4.7 and 4.8.

The distribution of the minimal sampling time of estimate coefficients for stable modeling of each application are showed in Figure 4.7.

Figure 4.7 shows that Terasort application needs the longest sampling time to reach sta-bility while Pi application needs the smallest one. The remaining applications need the similar minimal values. Overall, the stable memory usage models showed the smallest minimal sampling time. The results show that various resource-intensive application be-haved different requirements on minimal sampling time whereas the read/write-intensive application needs more.

The minimal sampling time distribution of statistic metrics are showed in Figure 4.8. As can be seen, the minimal sampling time of statistic metrics is smaller than ones of estimated coefficients. To different resource-intensive application, a read/write-intensive application such as Terasort needs the longest time to arrive at stability while Pi application shows the opposite needs.

It can be observed that MapReduce applications belonging to same resource-intensive class always show the similar minimal sampling time while ones in different resource-intensive class exhibited the various requirements for the minimal number of data samples. Read-/write intensive application needs the biggest minimal sampling time and CPU-intensive application requires the least. Therefore, the results showed that the fixed sampling time is not always reliable for stable modeling.

Figure 4.7. The minimal sampling time of MapReduce applica-tions

Figure 4.8. The minimal sampling time of statistical metrics

In document BudapestUniversityofTechnologyandEconomics Prof.VanTienDo Supervisor YangYuanLi by Ph.D.Dissertation AMethodtoProcessImagesDataandPredictionModelsforsomeMapReduceApplications (Pldal 66-69)