• Nem Talált Eredményt

Streamlined Analysis of Data at Rest and Data in Motion

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Streamlined Analysis of Data at Rest and Data in Motion"

Copied!
1
0
0

Teljes szövegt

(1)

STREAMLINE -

Streamlined Analysis of Data at Rest and Data in Motion

Philipp M. Grulich

1

, Tilmann Rabl

1,2

, Volker Markl

1,2

, Csaba Sidló

3

, Andras Benczur

3

1 German Research Center for Artificial Intelligence (DFKI),2TU Berlin,3Hungarian Academy of Sciences (MTA SZTAKI)

1firstname.lastname@dfki.de,2firstname.lastname@tu-berlin.de, 3lastname@sztaki.hu

ABSTRACT

STREAMLINE aims for improving the overall workflow of big data analytics systems. For this goal, it combines re- search in different areas to reduce the complexity of the work withdata at restanddata in motionin a unified fashion. As a foundation STREAMLINE offers a uniform programming model on top of Apache Flink, for which it drives innovations in a wide range of areas, such as interactivedata in motion visualization and advanced window aggregation techniques.

1. PROJECT SUMMARY

The STREAMLINE project aims to improve the workflow and usability of current big data analysis systems. Therefore it provides a uniform system, which is able to handle the analysis of big data at rest as well as fast data in motion.

With this platform, STREAMLINE enables a reduction of complexity, costs, and latency.

Traditionally batch- and stream-processing were consid- ered as two very different types of applications, but in the last years, it has been shown that the most real-world use- cases required systems for both workloads. This forces com- panies to integrate different specialized systems, which leads not only to complex system architecturesand introduces main- tenance overhead, it also introduces a high latency to the general data analysis workflow. This is also known as the problem of system and human latency in big data analysis.

Even technologies that are able to combinedata in motion anddata at rest are currently very complex and difficult to deploy, maintain and use. Beside this many companies have a demand for much more advanced analyses, which are still hard to implement in current systems.

To reduce this complexity STREAMLINE combines re- search and innovations in the areas of distributed systems, data management, and machine learning. Whereby STREAM- LINE’s key goal is to arrive at sustainable innovation by technology transfer to an established and growing open source project. STREAMLINE focuses on innovations in the area of the following four reactive and proactive applications:

2017, Copyright is with the authors. Published in Proc. 20th Inter-c national Conference on Extending Database Technology (EDBT), March 21-24, 2017 - Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceed- ings.org. Distribution of this paper is permitted under the terms of the Cre- ative Commons license CC-by-nc-nd 4.0

customer retention, personalized recommendations, target advertisement and multilingual Web processing. To inte- grate the innovations into the industry, STREAMLINE part- ners with multiple companies.

As its system foundation STREAMLINE relies on the open source data processing system Apache Flink, which is able to handle batch and stream processing on a single pipelined execution engine [1]. On top of this STREAM- LINE offers a single uniform programming model that can automatically be optimized, parallelized, and adopted to the system load, data distribution, and architecture.

Research Highlights:Cutty [2] introduces a general ag- gregation sharing framework for streaming windows, which outperforms previous solutions in order of magnitudes. This technique utilizes the fact that window aggregations are one of the most redundancy-prone operations in current stream processing. Cutty is also suitable for multi query aggregation sharing and non-periodic windows, such as ses- sion window, which can be used for more complex busi- ness logic. Based on this technique STREAMLINE enables higher throughput and improves the efficiency of its data processing platform.

I2[3] in contrast, focuses on the visualization and inter- active aggregation ofdata in motion, which is a key enabler for fast and efficient real-time data analysis. It contributes an interactive development environment, which coordinates the cluster application and includes interactive stream visu- alization techniques. With this I2is able to handle advanced and adaptive aggregations directly on the cluster. As one ex- ample we provide an aggregation algorithm for timer-series data, which reduces the amount of data in a data-rate in- dependent manner and is proven to be correct and minimal in terms of transferred data. Therefore I2is an important part of STREAMLINE, because it enhances the usability and accessibility of its platform.

2. ACKNOWLEDGEMENTS

This work was supported by the EU Horizon 2020 project Streamline (688191).

3. REFERENCES

[1] Carbone et al. Apache flink: Stream and batch processing in a single engine.IEEE Data Eng. Bull., 38(4), 2015.

[2] Carbone et al. Cutty: Aggregate sharing for

user-defined windows. InCIKM, pages 1201–1210, 2016.

[3] Traub et al. I2: Interactive real-time visualization for streaming data. InEDBT, EDBT, 2017.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

Using as a case study the example of big data and then moving on to data journalism, this article provides a theoretical overview of the mediated data model of communication

The EOP products released by the analysis center of IGS and IERS are used as the basic data to predict the polar motion parameters in groups of 600 and 1000 experiments with a

DATA ANALYSIS TASK: EXPLORING THE STAG ES OF QUESTION LEARNING The following data comes from out textbook, and it is worth having a look at as it has implications for our theories

As the Swedish example reflected above, Member States’ specific traditions and imple- mentations hedge off the demanded uniform application of the GDPR, although it offers Good

Transactions on Knowledge and Data Engineering (IEEE), Journal of Selected Topics in Signal Processing (IEEE), Advances in Data Analysis and Classification (Springer),.

While in gas-blast circuit breakers, the velocity of the gas flow primarily depends on outer effects (e.g. reservoire pressure), on the other hand in.. MODELLING ARC

All letters, numerals and punctuation marks should be able to be typed just as easily as with a normal typewriter. - During typing a cursor should always point to the

Multivariate data acquisition and data analysis (MVDA) tools. Usually advanced software packages which 1) aid in design of experiments (DoE), 2) collection of raw data and