CS 5614 Homework #6

Solution Sketches
  1. It is clear that we need to do a join of 7 copies of the Acts relation. Assume that the columns of the relation Acts are (ActorName,MovieName). Here's "a" solution:

    SELECT T1.MovieName, T2.ActorName, T3.MovieName, T4.ActorName, T5.MovieName
    FROM Acts AS T1, Acts AS T2, Acts AS T3, Acts AS T4, Acts AS T5, Acts AS T6
    WHERE T1.ActorName="WoodyHarrelson"
    AND T1.MovieName=T2.MovieName
    AND T2.ActorName=T3.ActorName
    AND T3.MovieName=T4.MovieName
    AND T4.ActorName=T5.ActorName
    AND T5.MovieName=T6.MovieName
    AND T6.ActorName="EddieMurphy"

    The answer you will get for this query is:
    (Natural Born Killers, Juliette Lewis, Cape Fear, Nick Nolte, 48 Hours).

  2. Assume that R is R(A,B) and S is S(B,C), where A, B and C are arbitrary sets of attributes. Three expressions for the semi-join are:

    (a) Pi_{A,B} (R Join S)
    (b) R Join Pi_B (S)
    (c) R Intersect (Pi_A (R) X Pi_B (S))

    Let us now attempt to estimate the size of the semi-join. There are clearly a multitude of ways to do this, depending on the actual formula used and on the assumptions you make. Here's one analysis. The first thing to notice is that unlike the join, the semijoin is assymetric. Thus, we need to be careful on how we do the estimates. Consider the case when V(R,B) <= V(S,B). We assume that the containment and preservation constraints on value sets is satisfied. Thus, each tuple in R will match some tuple in S on the "B" attributes. The semijoin would thus be R (itself).

    If, on the other hand, V(R,B) >= V(S,B), then not all tuples of R would be included in the final answer (only those that match with S on "B"). Since there are V(S,B) unique values of B in S, only those tuples of R that have these values will take part in the semi-join. However we have no idea how many tuples of R have these desired values. A quick fix is to use the estimate V(R,A)*V(S,B).

    The logical step now is to cast these two answers as special cases of one bigger formula. That sounds nasty until you notice that the first case can be modeled as a duplicate elimination situation. Our solution is to use: V(R,A)*min(V(R,B),V(S,B)) (why?). If this turns out to be greater than the actual size of R, then use R instead (this problem is endemic to most of our estimation formulas); but they work really well in the average case!

    Before you panic (in case your solution doesn't look like the above!), notice that these are estimates; you will still get full points if you were able to justify your answers. There is no one correct answer! However, you will lose some points if all you did was to reason from the above formulas verbatim and conclude that the size of the semijoin is the same as the size of the join! :-(

  3. The following test cases were used (the command line arguments in each case were identified):

    1. my_algorithm 300 400

      Answers for Greedy and DP are the same: Either of (R1 Join R2), (R2 Join R1)

    2. my_algorithm 200 100 500 20 50 1000

      Answers for Greedy and DP are the same: Either of (((R3 Join R4) Join R2) Join R1), (((R4 Join R3) Join R2) Join R1)

    3. my_algorithm 526 500 100 100 500 500 400 400

      Answers for DP are: ((((R3 Join R4) Join R5) Join R2) Join R1), or ((((R4 Join R3) Join R5) Join R2) Join R1)
      Answers for Greedy are: ((((R1 Join R2) Join R3) Join R4) Join R5), or ((((R2 Join R1) Join R3) Join R4) Join R5)

    4. my_algorithm 500 500 100 100 500 500 400 400

      Answers for DP are: ((((R3 Join R4) Join R5) Join R2) Join R1), ((((R4 Join R3) Join R5) Join R2) Join R1)
      Answers for Greedy are: ((((R3 Join R4) Join R5) Join R2) Join R1), ((((R4 Join R3) Join R5) Join R2) Join R1), ((((R1 Join R2) Join R3) Join R4) Join R5), ((((R2 Join R1) Join R3) Join R4) Join R5)

    Part (3) covers 50% of the weightage for this question.

Return Home