Another application-specific question raised within the previously mentioned project is, how to specify the similarity predicate forsimilarity joins or grouping consisting of the similarity or distance measure itself and the threshold. If the chosen threshold has such a major impact on the efficiency of similarity-basedoperations, as described in Section 5.4, the question is how to specify a threshold to meet requirements regarding efficiency and accuracy. Actually, this adds com- plexity to the well studied problem of over- and under-identification, i.e. falsely qualified duplicates. Information about the distance or similarity distribution can be used for deciding about a meaningful threshold, as well as for refining user- defined similarity predicates. Distance distributions usually conform to some nat- ural distribution, according to the specific application, data types, and semantics. Inconsistencies, such as duplicates, cause anomalies in the distribution, e.g. lo- cal minima or points of extreme curvature. Figure 5.7 depicts the edit distance distribution for one of the sample sets from Section 5.4 of 4000 tuples having approximately 20% duplicates with an equal distribution of 0, 1, or 2 edit opera- tions to some original tuple, which is apparent in the chart. To actually choose a threshold based on such a distribution, aspects of efficiency as well as quality of the duplicate detection process have to be considered. Hence, setting k 2 could be a reasonable result drawn from this chart alone.
two parts inside the composite order. As the ordering of the intervals is naturally given by the starting point of the interval and a different order could cause complications for region growing in the merge step, we focus our sorting heuristic on re-sorting the dimensions. Using the same aim to reduce the size of the tree, we sort dimensions with respect to their scattering. A dimension in which the data is distributed over a wide range should not be inserted first in the tree. Instead, a dimension in which the data is clustered in one region can lead to many common paths in the tree when inserted first. A second and more interesting aim for mining is the reduction of runtime, which in our case is limited to the DenseCluster with database access and to the merge operations that have to be performed on SC-trees. DenseCluster is already pruned due to in-process-removal of redundancy. Merges cannot be pruned as they are necessary to ensure the completeness of the eDSU C approach. Yet many merges can be avoided with a different processing order. Mining scattered dimensions with noisy data at first forces many merges on big SC-trees as no other dimensions have been restricted yet. These merges probably do not lead to subspace clusters at all. By mining the scattered dimensions last, eDSU C needs to perform merges only on very small SC- trees and this can be beneficial for the overall runtime.
robust parsing, chunk parsing, similarity-based learning
Current research on natural language parsing tends to gravitate toward one of two extremes: robust, partial parsing with the goal of broad data coverage versus more traditional parsers that aim at complete analysis for a narrowly defined set of data. Chunk pars- ing [1, 2] offers a particularly promising and by now widely used example of the former kind. The main insight that underlies the chunk parsing strategy is to isolate the (finite-state) analysis of non- recursive, syntactic structure, i.e. chunks, from larger, recursive structures. This results in a highly-efficient parsing architecture that is realized as a cascade of finite-state transducers and that pur-
Since 2008 the German Aerospace Center (DLR) started to prove that 1090ES ADS-B signals broadcasted by aircraft can be received on board of low earth orbiting (LEO) satellites. This was validated in 2013 by world’s first in-orbit demonstration of a space based ADS-B system, hosted on the ESA satellite PROBA-V. ADS-B uses two data links: 1090 Extended Squitter (ES) which is operating at 1090 MHz, and Universal Access Transceiver (UAT), which is oper- ating at 978 MHz. The latter is used within the US NAS only. The limitations of the different ADS-B systems are to be analyzed within the requirements for international operations. They have an influence on the performance requirements for Sat based ADS-B to allow minimized separation in NRA together with an impact on prevailing processes and flight safety stand- ards for the integration of commercial space flight operations in the Air Traffic Management (ATM). Further, integration of the data to the information exchange concept of the System Wide Information System (SWIM) is possible. Using a SWIM based service, spacecraft and spaceplanes can be integrated safely in the NAS/SESAR and in the worldwide system. Also applications for spacecraft tracking close to low earth orbits (LEO) during launch and reentry operation will be possible.
Another problem in many modern applications amongst others in medical imaging is the efficientsimilarity search in uncertain data. At present, an adequate therapy planning of newly detected brain tumors assumedly of glial origin needs invasive biopsy due to the fact that prognosis and treatment, both vary strongly for benign, low-grade, and high-grade tumors. To date differentiation of tumor grades is mainly based on the expertise of neuroradi- ologists examining contrast-enhanced Magnetic Resonance Images (MRI). To assist neuroradiologist experts during the differentiation between tumors of different malignancy we proposed a novel, efficientsimilarity search technique for uncertain data. The feature vector of an object is thereby not exactly known but is rather defined by a Probability Density Function (PDF) like a Gaussian Mixture Model (GMM). Previous work is limited to axis-parallel Gaussian distributions, hence, correlations between different features are not considered in these similarity searches. In this work a novel, efficient simi- larity search technique for general GMMs without independence assumption is presented. The actual components of a GMM are approximated in a conservative but tight way. The conservativity of the approach leads to a
Even though tactile and optical sensing technologies are widely used in data acquisition in dimensional metrology, it has been shown that each technique has its own characteristics and limitations, which lend them to particular applications. On the other hand, due to the different measuring techniques and their physical working principles, different interactions between the workpiece and sensor occurs and different surfaces are captured (Weckenmann et al. 2000). The reduction of the lead time in RE, and the increased requirements in terms of flexibility as well as accuracy have resulted in a great deal of research effort aimed at developing and implementing combined systems based on cooperative integration of homogeneous sensors such as mechanical probes and optical systems (Chan et al. 2000, Carbone et al. 2001, Bradley et al. 2001, Sladek et al. 2011). However, until now no relevant study has been found about how to efficiently handle integrated measurement data in reverse engineering, in other words, how to use more accurate measurement information to improve the overall measurement accuracy.
– Parallel query execution: The execution of query plans can be parallelized. In par- ticular, two aspects are well-suited for parallel execution: Retrieving IDs of relevant records from the similarity indexes and comparing selected records with the query. First, the similarity indexes can be queried in parallel. This is particularly useful if the calculation of similar attribute values takes a considerable amount of time. Because similarity calculation can take a different amount of time for all attributes, attention must be paid to good load balancing. When the set of relevant records that are to be compared with the query has been retrieved from the database, the comparisons can be performed in parallel as all comparisons are independent of each other. The selected record set can be equally distributed among the similarity comparators to achieve optimal load-balancing. All records that have a similarity above the retrieval threshold can immediately be added to the query result. We have implemented a preliminary version of the latter aspect as in our use case the comparisons take the largest fraction of the overall runtime. For this purpose, we created a main-memory baseddata structure that holds a subset of 10 million records from the Schufa data set. On this data set, a serial version of the query plan execution with about 5000 com- parisons takes 88.2 ms. A version with four threads performing the same number of comparisons takes 28.6 ms, which corresponds to 32.4 % of the single-threaded runtime and indicates a promising improvement.
In this paper, we propose a new data analysis algorithm for the FDR data with a different aspect compared to the existing works. This work is motivated by the previous work . Unlike the previous work, this study focuses on using the FDR data from the perspective of improving the airline maintenance operation, not the aircraft operations. In the reference , the Euclidian distance was used for a similarity measure of the continuous parameters and the discrete parameters, regardless of the parameter characteristics. Therefore, the similarity measure of discrete parameters may not be correctly computed in this approach. Also, in , a density-based clustering algorithm called DBSCAN (density-based spatial clustering of applications with noise)  was applied, but an output of this algorithm is sensitive to the design parameters regarding a density criteria of this algorithm. As a remedy, in this paper, time series data in the FDR are classified into three categories: a continuous, a discrete, and a warning signal, according to their parameter characteristics. A type of k-nearest neighbour approach is applied to detect the FDR data in which unusual flight patterns are recorded.
It should be noted that for the researchers themselves, the added value of RDM is indirect, so active integration into research processes is crucial [ 14 ]. Figure 1.1 shows the Research Data Life Cycle ( RDLC ), which is ideally supported throughout by suitable IT systems. Data records pass through this cycle again and again, which is why support during the transitions of the individual phases is particularly important. Besides, the role of a dedicated “Data Curator”, who is familiar with certain techniques/standards and technologies, is progressively disappearing within the context of RDM ; instead, every single researcher is under obligation. This trend indicates that many “IT skills” are becoming more and more “general education” in the context of digitization. To support the researcher in all phases of the RDLC , the goal of RWTH is to create a basic infrastructure of connected services [ 9 ]. To enable research groups to continue using individual components, technology-independent and process-oriented interfaces will be created. These ensure that different systems are bundled to generate added value for the researcher. Several components are used for this purpose: (i) the Persistent Identifier ( PID ) for the permanent and unique referencing of research data, (ii) external metadata stored with the PID , which enables the use of discipline-specific systems, (iii) a central archive system in which researchers can store their research data, and (iv) the support of the University Library to publish the results in a suitable repository. As the importance of descriptive and explanatory metadata is growing, an application has been developed that allows the creation of metadata for institution-specific application profiles. During the creation of research data, this metadata information is still very easily available, whereas during the further course of the RDLC this implicit knowledge is not passed on and is lost [ 15 ]. The application profiles are based on existing metadata standards. Moreover, an automated connection should be possible to be able to integrate a machine generation of metadata into existing work processes.
As observed from our Literature survey for Big Data Benchmarks, we found that there are no clear standard benchmarks defined for the evaluation of the performance of Big Data systems in the storage and retrieval of embedding data. Hence we further elaborate upon two closely related works in this domain particularly Wang et al.  who discuss benchmarking array data with DSIM bench and the work of Taft et al. , which also considers a Big Data benchmark that looks at array data. The relevancy of these papers in the context of microarray data to understand the benchmarking for embeddings can be mapped to the definition of microarrays to be anchored arrays of short DNA elements as part of a large-scale gene expression , which represents similarity in its context to embeddings. In this context, Wang et al.  present DSIMBench as a benchmark to efficiently evaluate analytical workflows highlighting the most optimal setup in the analysis of microarray data in R. The experimental evaluation with the aim to optimize Big microarray data analysis in R using DSIMBench  consists of eight R workflows based on different data distributions and computational solutions measuring performance characteristics in dealing with data loading and parallel computation. Additionally, Taft et al.  present GenBase as a benchmark to evaluate the performance of systems forefficient Big Datadata management & analytics. The benchmark evaluates on five queries including Predictive Modeling, Covariance, Biclustering, Singular Value Decomposition (SVD)  and Statistical Tests, with each consisting of a mix of data management, linear algebra and statistical operations workloads. Additionally, the experimental evaluation of GenBase  is based on a representative use case of microarray genomics data workloads but can be extended to other domains consisting of mixed workloads of data management and complex analytics.
The second part of this thesis deals with two major data mining tech- niques: clustering and classification. Since privacy preservation is a very important demand of distributed advanced applications, we propose using uncertainty fordata obfuscation in order to provide privacy preservation dur- ing clustering. Furthermore, a model-based and a density-based clustering method for multi-instance objects are developed. Afterwards, original exten- sions and enhancements of the density-based clustering algorithms DBSCAN and OPTICS for handling multi-represented objects are introduced. Since several advanced database systems like biological or multimedia database systems handle predefined, very large class systems, two novel classification techniques for large class sets that benefit from using multiple representa- tions are defined. The first classification method is based on the idea of a k-nearest-neighbor classifier. It employs a novel density-based technique to reduce training instances and exploits the entropy impurity of the lo- cal neighborhood in order to weight a given representation. The second technique addresses hierarchically-organized class systems. It uses a novel hierarchical, supervised method for the reduction of large multi-instance ob- jects, e.g. audio or video, and applies support vector machines forefficient hierarchical classification of multi-represented objects. User benefits of this technique are demonstrated by a prototype that performs a classification of large music collections.
experience when binocular disparity is given, instead of producing it from monocular images. Finally, our fusion is not limited to inference of disparity, but could include other modalities such as observer motion, multiple images or real-time sensor data. This thesis has injected “islands” of intelligence into the image-based rendering pipeline. However, considering recent developments in machine learning, end-to-end trainable pipelines tend to outperform most alternatives. Promising future work could directly apply this line of thought to the algorithms developed in this thesis. Point patterns can be optimized for a specific use case, like Monte Carlo integration in a domain of interest. In the context of our Deep Point Correlation Design, only the agenda would need to change to incorporate task-specific losses. Likewise, Laplacian Kernel Splatting could benefit from an end-to-end approach. While the current system relies on hard-coded mathematics to integrate a sparse PSF representation, a trainable densification scheme would probably achieve higher compression rates and efficiency. That way, our system would become an auto-encoder [Hinton and Salakhutdinov, 2006] for PSFs. Stereo Cue Fusion provides back-propagatable modules that can complement a fully trainable deep system.
5.5 Similarity Join to Support Density-based Clustering 57 is fulfilled, we have a choice of two options how to increment the counters of the objects. Option 1 is at first to add the counter of x and then the counter of q using the atomic operation atomicInc() (cf. Section 4.3). This yields the synchronization of all participating threads. As already mentioned before, the atomic operations serve to assure the correctness of the result, when various threads simultaneously try to increment the counters of objects. In clustering, we typically have many core objects which causes a large number of synchronized operations limiting the parallelism. Therefore, we also implemented option 2 which ensures correctness without synchronized operations. As soon as a pair of objects (x, q) satisfies the join condi- tion, we only increment the counter of point q. Point q corresponds to the point of the outer loop for which the separate thread has been created, which means q is exclusively associated with the threadID. Therefore, the cell counter[threadID] can be safely incremented with the ordinary, non- synchronized operation inc(). Since no other point is associated with the same threadID as q no collision can happen. However, note that unlike op- tion 1, for each point of the outer loop, the inner loop needs to consider all other points. Otherwise results are missed. Recall that for the conventional sequential nested loop join (cf. Figure 5.4) it suffices to consider in the in- ner loop only those points which have not been processed so far. Already processed points can be excluded because if they are join partners of the current point, this has already been detected. The same holds for option 1. Because of parallelism, we can not state which objects have been already processed. However, it is still sufficient when each object searches in the inner loop for join partners among those objects which would appear later in the sequential processing order. This is because all other object are ad- dressed by different threads. Option 2 requires checking all objects since only one counter is incremented. With sequential processing, option 2 would thus duplicate the workload. However, as our results in Section 5.6 demonstrate, option 2 can pay-off under certain conditions since parallelism is not limited by synchronization.
Parallel Density-Based Clustering. In [XJK99] a parallel version of DBSCAN has been introduced. The algorithm starts with the complete data set residing on one central sever and then distributes the data among the different clients. The underlying data structure is the dR ∗ -tree, a modifi- cation of the R ∗ -tree [BKSS90]. The directory of the R ∗ -tree is replicated on all available computers to enable efficient access to the distributed data. This distributed R ∗ -tree is called the dR ∗ -tree, which has the following struc- tural differences from a traditional centralized R ∗ -tree: the data pages are distributed on different computers, the indices are replicated on all comput- ers, and the pointer to a data page consists of a computer identifier and a page ID. In order to distribute the different data pages onto the different slaves, the centers of the leaf pages are ordered by their Hilbert values. Then each client receives an equal number of data pages having adjacent Hilbert values. The different slaves communicate via message-passing and cluster their data separately. Finally, the server has to merge the different cluster- ing results. This approach suffers from several drawbacks. First, the clients have to communicate to each other after the partitioning of the data took place. Second, the approach is only applicable to feature vector represented objects but not generally to metric objects. Third, the existence of an in- dex structure is presumed. In the remainder of this chapter, we will present a more general approach for parallelizing DBSCAN which overcomes all of these shortcomings.
We were able to determine an effective default combination strategy for aggregating matcher-specific results and selecting match candidates. Average proved to be the aggre- gation method of choice as it could best compensate the shortcomings of individual matchers. Our undirectional approach Both supports very good precision and thus pro- duced usually better match results than directional approaches. For match tasks with many m:n matches (e.g., due to shared elements), the most accurate predictions can be achieved by selecting match candidates showing the (approximately) highest similarity exceeding an average threshold (Threshold+MaxDelta). To compute combined similarity between sets of elements, the pessimistic strategy Average, which takes element similari- ties into account, performs better than the optimistic strategy Dice only considering the ratio of similar elements. The stable behavior of the default combination strategies indi- cates that they can be used for many match tasks thereby limiting the tuning effort. Context-dependent matching is required for schemas with shared components, but also represents a big challenge in case of large schemas. The NoContext strategy yields unac- ceptable quality for our test schemas and is therefore only feasible for schemas without shared components. For small schemas, AllContext shows a slightly better quality than Fil- teredContext. Both, however, achieve almost the same quality in the large match tasks. All- Context, when applied to complete schemas or shared fragments, are very expensive. Furthermore, it is sensitive to Delta, especially in large schemas, showing fast degrading quality with increasing Delta. On the other hand, FilteredContext performs in general very fast. It is robust against the choice of Delta, thereby limiting tuning effort. Hence, while AllContext is the strategy of choice for small schemas, FilteredContext suits best for large schemas. Fragment-based matching, especially at the subschema level, also represents an effective strategy for dealing with large schemas.
The steadily increasing air traffic and commercial space traffic in particular on transcontinental routes or suborbital operations requires to extend the controlled airspace to those regions not yet covered by ground based surveillance. An ADS-B system with a strong focus on space-based ADS-B can provide global and continuous air and space surveillance to enhance the operation of spacecraft and spaceplanes in transit through the US National Airspace System (NAS) and Single European Sky (SESAR) and above. Such a system can overcome the prevailing surveillance constraints in non-radar airspace (NRA). The limitations of the different ADS-B systems will be discussed within the requirements for international operations. They have an influence on the performance requirements for Sat based ADS-B to allow minimized separation in NRA together with an impact on prevailing processes and flight safety standards for the integration of commercial space flight operations in the Air Traffic Management (ATM). Further, integration of the data to the information exchange concept of the System Wide Information System (SWIM) is proposed. Using a SWIM based service, spacecraft and spaceplanes can be integrated safely in the NAS/SESAR and in the worldwide system. Also applications for spacecraft tracking close to low earth orbits (LEO) during launch and reentry operation will be possible.
Figure 6. Organizational scheme of data relationships in the process of costing per product
The share of operational costs in the production of agricultural products (plant and animal products and services) obtained on costing is usually calculated per physical unit of measurement of the product (piece, kilogram, ton, liter). In a very commonly known example when agricultural enterprises produce a wide range of products (of plant and animal origin) as well as services and the products within the same product group require similar expenditure (e.g. cereals and root crops in mass-scale production), then the share of operational costs is obtained as a result of process costing – as an average cost. This calculation is based on cost data, related to the value of incurred costs, and operationsdata, related to the quantity of products produced and processes realized in the assumed unit of measurement. Costing in case of operational costs per any cost carrier – product, process, action – requires a precise cost identification and identification of the share of individual assets, for which costs were incurred. Calculation of all costs per product, process or action results from costs of all assets used, as it is presented in Figure 6.
As described in chapter 2, the use of index structures is a standard approach to speed up query processing forsimilarity search. Since the matching distance together with attributed graphs forms no vec- tor space but a metric space, index structures for metric spaces are needed to speed up query processing with the edge matching distance as similarity measure. Additionally, the index structures should be fully dynamic in the sense that the insertion of a single graph to the database does never require a complete reorganization of the index structure. In other words, update operations on the index structure should be possible efficiently. One reason for this requirement is that databases are usually updated very often. Therefore, also index up- dates have to be carried out often. But even in application scenarios where updates are performed only periodically, like in data warehouses, dynamic properties of the index structure are important. In those ap- plications, an effective and efficient index structure is necessary for the acceptable performance of subsequent data mining steps. Obviously, the update of the index structure must not take longer than the time gained by the use of an index structure in the data mining step. Oth- erwise, the positive effects of using an index structure are used up by excessive index update times. Static index structures which have to be rebuilt after each database update usually cannot fulfill this require- ment.
During mass events or natural disasters security authorities and emergency services are dependent on precise and up-to-date in- formation about the operational area. Since under such extreme conditions the scenario can change completely every couple of minutes the current traffic situation and the condition of motor- ways are of special interest. The command centers demand for both a detailed and an up-to-date overview of the area. However, stationary sensors like inductive loops, radar sensors or terres- trial traffic cameras have a poor spatail resolution whereas street maps have a low temporal resolution, i.e. they might be sim- ply out of date. Moreover in cases of natural disaster a traffic monitoring system should not only depend on an intact infrastruc- ture. Therefore the German Aerospace Center (DLR) develops an airborne wide-area monitoring system, which is able to transmit high-resolution optical images and traffic data of the affected area in near real time to the ground. This system is mounted aboard an airplane and consists of the DLR 3K-camera system (Kurz et al., 2007), a processing rack and a slave tracking antenna. The automatic onboard analysis including orhtorectification, georef- erencing and road-traffic data extraction demand high standards to the onboard hardware/software system. Three cameras pro- duce images at a data rate of up to 54 MB/s. To cope with this large data rate each camera is dedicated to one computer. After orthorectification the images are forwarded to the vehicle detec- tion and tracking processor (Rosenbaum et al., 2008), which runs on a fourth computer. Those two processes are computationally intensive tasks and should be responsible for the lion’s share of the processing time. Nevertheless all steps in the processing chain perform read and write operations, copy the data into main mem- ory or send it via Ethernet to another host, which sums up to a considerable amount of the processing time. Therefore this paper investigates the processing chain and tries to reveal bottle necks and data congestions in order to optimize the overall processing time. The faster it is the more data can be sent early enough to the user on the ground.
For computer simulation [RM98], e.g., output data typically needs to be visualized to allow the user a more intuitive and reliable inspection of the results. The respective al- gorithms are mostly geometry based. In medical imaging [Ban00], X-ray images and CT scans can either be considered to be 2D/3D images or they can be understood as special geometric shape representations of the respective organs which enables the derivation of physical properties like their volume or weights. 3D geometry has also been widely used in public entertainment such as animated feature movies (e.g. “Shrek”, “Find- ing Nemo” and “The Incredibles”) made with special effects and computer animation technology [Ker03]. In addition, diverse enhanced multimedia and network applica- tions including virtual museums, Avatars, 3D-TV, online-encyclopedias, E-commerce, E-learning, and architectural planning, etc., have benefited a lot from the integration of 3D models into digital documents. Having more descriptive power than previous mul- timedia data types, 3D geometry data enables common users for the first time to truly interact with the displayed contents, beyond only running pre-recorded audio and video footage or following hypertext links.