CS 5614 Homework #6
Date Assigned: October 29, 1999
Date Due: November 8, 1999, in class, before class starts
- (20 points) This is really a silly question for a graduate
course, but we include it for fun times sake!
An indulgent (and appears-to-be-worthless) pasttime in the movie industry is to make
"connections" among actors and actresses via the movies
they have acted in (together).
For example, to make a connection from Tom Hanks to Kevin Bacon,
it just takes one step (Both of them have acted in "Apollo 13",
for instance). But to make a link from Meg Ryan to Kevin Bacon,
it takes two steps (they have to be "routed" through Tom Hanks,
because they haven't acted in a movie together). So from Ryan,
you get to Sleepless-in-Seattle, from there you go to Hanks,
then you take Apollo 13 to Bacon. In fact, some UVA
students have mined the Internet Movie database to create
an "Oracle of Bacon"!
Similar procedures exist in the academic world (Erdos number,
where people are linked by the common publications they author) etc.
Academicians and researchers then proudly flaunt their erdos number!
Popular fables in India claim that with "six intermediaries", one
can reach anybody else in the world! (So Kings' "power" used to
be measured with this number; myth has it that nobody reached
6, because the gods ensured that no king became so powerful :-))
A full blown article on this recently appeared in the
Dec 1998 issue
of DISCOVER magazine.
Anyways, back to CS 5614. Not to be outdone, we intend to use database technology to identify
connections among people. Assume that you are given the following tuples
in a two-column relation called Acts:
(Uma Thurman, Mad Dog and Glory)
(Robert De Niro, Cape Fear)
(Woody Harrelson, Natural Born Killers)
(Dan Aykroyd, The Actor)
(Judge Reinhold, Beverly Hills Cop)
(Jamie Lee Curtis, True Lies)
(Woody Harrelson, Palmetto)
(Nick Nolte, 48 Hours)
(James Russo, Beverly Hills Cop)
(Eddie Murphy, 48 Hours)
(Jamie Lee Curtis, Perfect)
(Tommy Lee Jones, Natural Born Killers)
(Dan Aykroyd, North)
(Tara King, 48 Hours)
(Robert De Niro, Sleepers)
(Nick Nolte, Cape Fear)
(Juliette Lewis, Natural Born Killers)
(Jessica Lange, Cape Fear)
(Eddie Murphy, Beverly Hills Cop)
(Robert De Niro, Mad Dog and Glory)
(Robert Downey Jr., Natural Born Killers)
(Eddie Murphy, Trading Places)
(Juliette Lewis, Cape Fear)
Your task is to make a connection between Woody Harrelson and Eddie Murphy,
using *ONLY* the data shown above. Of course, you realize that we are
not really worried about how these two are related (in the totality
of real-life data), but to see if you can
effect the connection via an SQL query. So, use the diagram below;
your goal is to fill out the missing two star names and three movie
names. To do that, you need to write an SQL query and run it only
on the data shown above (again, we are not interested in "complete and real" data,
only the results on the data shown above).
For full credit, submit your SQL query and the result that you get. To
do this exercise, you need to actually sit in front of the computer,
load the data, write the query and report on the answer (HINT:
There is only one solution).
FAQ: I found a "shorter path" between these two actors!!!
Is that okay?
Answer: No, that is not okay. Pl. realize that our
goal is not to mine the relationships among actors but to
learn some database stuff. Your answer should contain exactly
three movies and two actors.
FAQ: I found the answer, just by looking at it. Is that
okay?
Answer: No, again. We need the SQL query.
- (20+20=40 points) The semi-join of relations R and S, is the set of
tuples of R that agree with at least one tuple of S on all attributes
that are common to the schemas of R and S. Give an expresion in relational
algebra that is equivalent to "R semi-join S". Also explain how you will
estimate the size of a semi-join, using our ideas for estimating the
size of natural joins?
- (40 points) Assume that you are given "n" relations - R1, R2, ..., Rn.
Each of the "R"s has only two attributes. The second attribute of a relation
is the same as the first attribute of the "next" relation. e.g. R1 could
be R1(X,Y), then R2 would be R2(Y,Z), R3 is R3(Z,L) and so on. Thus, the
total number of "distinct" attributes in all the relations is (n+1).
Write a program in your favorite programming language
(C/C++/Perl etc.) that determines an optimal join strategy for computing
"R1 Join R2 .... Rn". Assume that all relations are of size 1000.
The program will take as input the "V" information, for only the "common"
attributes (you don't really need any other data). The number of "V"s
needed is 2(n-2) + 2. Your program will take them as command line arguments.
For example, one invocation of your program would be:
> my_algorithm 200 100 500 20
This attempts to do a join of three relations: R(X,Y), S(Y,Z), T(Z,L),
where #(R) = #(S) = #(Y) = 1000, and V(R,Y) = 200, V(S,Y) = 100, V(S,Z) = 500
and V(T,Z) = 20. You can restrict your answer to only left-deep joins.
Your program should output two "groupings":
- The grouping obtained by a dynamic programming search
- The grouping obtained by a greedy search
In each case, output your answer with suitable parentheses and/or indentation.
For example, for the above "run", the answer should be:
Dynamic Programming Ordering: ((R2 Join R3) Join R1)
Greedy Ordering: ((R2 Join R3) Join R1)
(In this case, they are the same, but you cannot generally assume that
they are identical). Recall that the dynamic programming approach searches
through all possible O(n!) combinations for left-deep trees and picks out
the ones that are "cheapest". What does "cheapest" mean? Recall again that
cost of a join is the sum of the sizes of all the intermediate relations
considered. The greedy algorithm, on the other hand, picks the two relations
whose estimated join is smallest, then incrementally adds relations (such that
the join size is minimized) at each step. It thus makes local decisions at
every time. As another example, the answer for:
> my_algorithm 200 100 500 20 50 1000
is:
Dynamic Programming Ordering: (((R3 Join R4) Join R2) Join R1)
Greedy Ordering: (((R3 Join R4) Join R2) Join R1)
Again, notice that the greedy answer is the same as the DP one, but this
is not generally the case.
What you should turnin: A location on the csgrad lab machines, in
your home directory, where we can go and test out this ".exe" file. You
are welcome to improvise as much as you want, so long as you satisfy the
stated requirements. We will not provide the "data" (nor tell you
how many "n"s there are going to be) on which we are going
to test your program beforehand, but feel free to exchange test data
(not code!) among yourselves.
FAQ: This question is downright sadistic! What is the point
in asking us to implement this painful program?
Answer: Our goal here is to give you an appreciation for how
a query compiler chooses a logical plan. The actual code doesn't take more
than 2 pages of any high level lang. code. It is our belief that the
only way to "gain this appreciation"is to get your hands dirty and code
this up. (You may have some surprises when you go from three relations
to four relations)