1Asst.Prof,IT Dept., SRIT, Anantapur, Andhra Pradesh
2Principal, JNTUACEA, Anantapur, Andhra Pradesh
Visit for more related articles at International Journal of Advance Innovations, Thoughts & Ideas
Adaptive join algorithms were created to overcome the drawbacks of traditional join algorithms in emerging data integration or online aggregation environments. The input relations to adaptive joins are continuously retrieved from remote sources. The main objective for designing these algorithms is to i) start producing the first output tuples as soon as possible ii) produce the remaining results at a fast rate. One of the early adaptive join algorithm Multiple Index Nested-loop Reactive join (MINER) is a multi-way join operator used for joining an arbitrary number of input sources. Here MINER was limited to chain joins. In this paper, MINER is extended to support snowflake joins, where each relation may participate in joins with more than two join attributes. It will improve producing result tuples at a significantly higher rate, while making better use of the available memory.
Query processing, Snowflake joins, Streams.
In a distributed environment, statistical information for the available data sources may be minimal, where the availability or load of physical resources is prone to be changed. Consequently, traditional query optimization can lead to poor performance, especially in long running query evaluations, as the query optimizer may not, at compile time, have the necessary statistics, good selectivity estimates, knowledge of the runtime bindings in the query, or knowledge of the available system resources required to produce an optimal query plan (QP). In addition, traditional optimizers cannot predict the future availability of resources. Adaptive query processing (AQP) addresses this, by adapting the query processing to changing environmental conditions at runtime.
Traditional join algorithms [9,10] assume that all input data is available beforehand. This assumption is not suitable in emerging data integration or online aggregation environments; a key performance metric is rapid availability of first results & higher join result rate. With the goals of avoiding the blocking behavior of remote data sources & producing join results as quickly as possible a family of adaptive join algorithms are developed [1,3,4,5,6,7]. Adaptive join algorithms have the ability of non-blocking behavior and producing join results even if one or both sources are blocked.
Adaptive joins are designed to deal with some additional challenges over traditional join algorithms: input relations to them are provided by autonomous data sources through heterogeneous networks. Data is transported through unreliable network environments. The problem is that data arrival rate can’t be controlled. Since the data access over wide-area networks involves a large number of remote data sources, intermediate sites & communication links, all of which are vulnerable to congestion and failures. Adaptive join algorithms overcome situation like initial delay, slow data delivery, or bursty arrival [2,11] which can affect the efficiency of join. Most existing algorithms are limited to a two-way join. Devising an effective multiway adaptive join operator is a challenge in which little progress has been made [1,3]. In this paper, a similar approach of MINER[1] that supports snowflake joins is proposed, where data are held by multiple remote sources.
The rest of the paper is organized as follows: Section 2 illustrates Snowflake MINER architecture and its operations in detail. In Section 3, MINER for Snowflake joins algorithm is analyzed. Section 4 is carried out through experiments and results their conclusions and future work are proposed in Section 5.
A snowflake join is a join where a large table is joined with three or more smaller base tables or joins relations, where these smaller base tables may themselves are joined to three or more smaller base tables or joins relations. The following architecture is proposed to extend the MINER to support snowflake joins.
Fig. 1 shows MINER for snowflake joins architecture. MINER for snowflake joins proceeds in three stages, each of which is performed by a separate thread. The First Stage joins memory resident tuples. Tuples arrive in the input buffer. Tuples are processed in memory using an index. Some statistics might be kept about relations (e.g., cardinalities). When memory is exhausted, some tuples are flushed on disk by using flushing policy [5]. The Second Stage joins tuples that have been flushed to disk due to memory constraints. It is activated when all streams experience delay. The Third Stage is a clean-up stage, which performs any necessary matching to produce results missed by the first two stages. The first and second stages run in an interleaved fashion, the second stage takes over when the first becomes blocked due to a lack of input. These stages are terminated after all input has been received, at which point the third stage is initiated.
In this section algorithm for MINER to support snowflake joins of the following form is proposed
Fig. 2 illustrates each relation may participate in joins with more than two join attributes.
Step1: While relations RA, RB, RC and RD still have tuples do following steps.
Step2: If tuple ti € Ri arrived (i € {RA, RB, RC ,RD}) in input buffer, then move tuples from input buffer to MINER process space.
Step3: Apply join operation for set of matching tuples found using Indexes.
Step4: When Used Memory exceeds Threshold limit then flush those unprocessed tuples that reside in MINER process space for maximum time to disk such that Used Memory must come into below Threshold limit.
Step5: If transmission of RA, RB, RC and RD is blocked more than wait threshold then
5a. Bring tuples that have been flushed to disk due to memory constraints to MINER process space.
5b. Apply join operation for set of matching tuples found using Indexes.
5c. If maximum numbers of input tuples are collected in input buffer size is satisfied then repeat steps 2 to 4.
Once all relations have been received in their entirely repeat steps 2 to 4.
There exist multiple join attributes per relation, hence algorithm maintains for each input relation a separate index on each join attribute. In this algorithm hash-table data structure is utilized as indexes. In Step 2 Algorithm is said to be in First Stage, in which tuples will arrive to input buffer. When a new tuple comes, it is stored and indexed in the Miner process space of its relation, based on the relation’s join attributes. Then this new tuple is checked for matching with all the in-memory tuples belonging to all other relations participating in the join. This process will continue until memory is exhausted, or else algorithm flushes tuples from the relation with the largest number of in-memory tuples to respective disks, which is mentioned in step 4.
When all data sources experience delays then Second Stage will be activated, in which algorithm performs joins between previously flushed data. Any matches found are output as result tuples. So that algorithm will progress while no input is being delivered. If any of the join inputs have resumed producing tuples, then the algorithm switches back to the First Stage. After all tuples have been received from all relations then Third Stage executes, that makes sure that all the tuples should be in the result set are ultimately produced.
Table 1 illustrates the percentage of the total result obtained by algorithm during the first stage, in the presence of multiple inputs while we vary the available memory. Result of Table 1 is summarized in Fig. 3 which presents a comparison between MINER for SSTAR joins and MINER for Snowflake joins. There is a modest increase in the number of results obtained as we increase the available memory.
A novel adaptive join algorithm MINER for Snowflake Joins has been proposed, that produces result tuples at a significantly higher rate, while making better use of the available memory. It is also designed for dealing with cases where data are held by multiple remote sources and relations that may participate in joins with more than two join attributes.
In future work it can explore to the optimization problem of how to best to split a very large multi-way join into a set of smaller multi-way joins.
Make the best use of Scientific Research and information from our 700 + peer reviewed, ºÚÁÏÍø Journals