CS 5614 Homework #6
Solution Sketches
- It is clear that we need to do a join of 7 copies of the Acts
relation. Assume that the columns of the relation Acts are
(ActorName,MovieName). Here's "a" solution:
SELECT T1.MovieName, T2.ActorName, T3.MovieName, T4.ActorName, T5.MovieName
FROM Acts AS T1, Acts AS T2, Acts AS T3, Acts AS T4, Acts AS T5, Acts AS T6
WHERE T1.ActorName="WoodyHarrelson"
AND T1.MovieName=T2.MovieName
AND T2.ActorName=T3.ActorName
AND T3.MovieName=T4.MovieName
AND T4.ActorName=T5.ActorName
AND T5.MovieName=T6.MovieName
AND T6.ActorName="EddieMurphy"
The answer you will get for this query is:
(Natural Born Killers, Juliette Lewis, Cape Fear, Nick Nolte, 48 Hours).
- Assume that R is R(A,B) and S is S(B,C), where A, B and C are
arbitrary sets of attributes. Three expressions for the semi-join are:
(a) Pi_{A,B} (R Join S)
(b) R Join Pi_B (S)
(c) R Intersect (Pi_A (R) X Pi_B (S))
Let us now attempt to estimate the size of the semi-join.
There are clearly a multitude of ways to do this, depending on
the actual formula used and on the assumptions you make. Here's one
analysis. The first thing to notice is that unlike the join, the semijoin is assymetric. Thus, we need to be careful on how we do the estimates.
Consider the case when V(R,B) <= V(S,B). We assume that
the containment and preservation constraints on value sets is satisfied.
Thus, each tuple in R will match some tuple in S on the "B" attributes.
The semijoin would thus be R (itself). If, on the other hand,
V(R,B) >= V(S,B), then not all tuples of R would be included in
the final answer (only those that match with S on "B").
Since there are V(S,B) unique values of B in S, only those
tuples of R that have these values will take part in the semi-join.
However we have no idea how many tuples of R have these desired
values. A quick fix is to use the estimate V(R,A)*V(S,B).
The logical step now is to cast these two answers as special cases
of one bigger formula. That sounds nasty until you notice that
the first case can be modeled as a duplicate elimination situation.
Our solution is to use: V(R,A)*min(V(R,B),V(S,B)) (why?). If this
turns out to be greater than the actual size of R, then use R instead (this
problem is endemic to most of our estimation formulas); but they work really
well in the average case!
Before you panic (in case your solution doesn't look like the above!),
notice that these are estimates; you will still get full points
if you were able to justify your answers. There is
no one correct answer! However, you will lose some points if all you
did was to reason from the above formulas verbatim and conclude that
the size of the semijoin is the same as the size of the join! :-(
- The following test cases were used (the command line arguments in each
case were identified):
- my_algorithm 300 400
Answers for Greedy and DP are the same: Either of (R1 Join R2), (R2 Join R1)
- my_algorithm 200 100 500 20 50 1000
Answers for Greedy and DP are the same: Either of
(((R3 Join R4) Join R2) Join R1),
(((R4 Join R3) Join R2) Join R1)
- my_algorithm 526 500 100 100 500 500 400 400
Answers for DP are: ((((R3 Join R4) Join R5) Join R2) Join R1), or ((((R4 Join R3) Join R5) Join R2) Join R1)
Answers for Greedy are: ((((R1 Join R2) Join R3) Join R4) Join R5), or ((((R2 Join R1) Join R3) Join R4) Join R5)
- my_algorithm 500 500 100 100 500 500 400 400
Answers for DP are: ((((R3 Join R4) Join R5) Join R2) Join R1),
((((R4 Join R3) Join R5) Join R2) Join R1)
Answers for Greedy are: ((((R3 Join R4) Join R5) Join R2) Join R1), ((((R4 Join R3) Join R5) Join R2) Join R1),
((((R1 Join R2) Join R3) Join R4) Join R5), ((((R2 Join R1) Join R3) Join R4) Join R5)
Part (3) covers 50% of the weightage for this question.