CS 5614 Homework #6

Date Assigned: October 29, 1999
Date Due: November 8, 1999, in class, before class starts

(20 points) This is really a silly question for a graduate course, but we include it for fun times sake! An indulgent (and appears-to-be-worthless) pasttime in the movie industry is to make "connections" among actors and actresses via the movies they have acted in (together). For example, to make a connection from Tom Hanks to Kevin Bacon, it just takes one step (Both of them have acted in "Apollo 13", for instance). But to make a link from Meg Ryan to Kevin Bacon, it takes two steps (they have to be "routed" through Tom Hanks, because they haven't acted in a movie together). So from Ryan, you get to Sleepless-in-Seattle, from there you go to Hanks, then you take Apollo 13 to Bacon. In fact, some UVA students have mined the Internet Movie database to create an "Oracle of Bacon"! Similar procedures exist in the academic world (Erdos number, where people are linked by the common publications they author) etc. Academicians and researchers then proudly flaunt their erdos number! Popular fables in India claim that with "six intermediaries", one can reach anybody else in the world! (So Kings' "power" used to be measured with this number; myth has it that nobody reached 6, because the gods ensured that no king became so powerful :-)) A full blown article on this recently appeared in the Dec 1998 issue of DISCOVER magazine.
Anyways, back to CS 5614. Not to be outdone, we intend to use database technology to identify connections among people. Assume that you are given the following tuples in a two-column relation called Acts:

(Uma Thurman, Mad Dog and Glory)
(Robert De Niro, Cape Fear)
(Woody Harrelson, Natural Born Killers)
(Dan Aykroyd, The Actor)
(Judge Reinhold, Beverly Hills Cop)
(Jamie Lee Curtis, True Lies)
(Woody Harrelson, Palmetto)
(Nick Nolte, 48 Hours)
(James Russo, Beverly Hills Cop)
(Eddie Murphy, 48 Hours)
(Jamie Lee Curtis, Perfect)
(Tommy Lee Jones, Natural Born Killers)
(Dan Aykroyd, North)
(Tara King, 48 Hours)
(Robert De Niro, Sleepers)
(Nick Nolte, Cape Fear)
(Juliette Lewis, Natural Born Killers)
(Jessica Lange, Cape Fear)
(Eddie Murphy, Beverly Hills Cop)
(Robert De Niro, Mad Dog and Glory)
(Robert Downey Jr., Natural Born Killers)
(Eddie Murphy, Trading Places)
(Juliette Lewis, Cape Fear)

Your task is to make a connection between Woody Harrelson and Eddie Murphy, using *ONLY* the data shown above. Of course, you realize that we are not really worried about how these two are related (in the totality of real-life data), but to see if you can effect the connection via an SQL query. So, use the diagram below; your goal is to fill out the missing two star names and three movie names. To do that, you need to write an SQL query and run it only on the data shown above (again, we are not interested in "complete and real" data, only the results on the data shown above).
For full credit, submit your SQL query and the result that you get. To do this exercise, you need to actually sit in front of the computer, load the data, write the query and report on the answer (HINT: There is only one solution).
FAQ: I found a "shorter path" between these two actors!!! Is that okay?

Answer: No, that is not okay. Pl. realize that our goal is not to mine the relationships among actors but to learn some database stuff. Your answer should contain exactly three movies and two actors.
FAQ: I found the answer, just by looking at it. Is that okay?

Answer: No, again. We need the SQL query.
(20+20=40 points) The semi-join of relations R and S, is the set of tuples of R that agree with at least one tuple of S on all attributes that are common to the schemas of R and S. Give an expresion in relational algebra that is equivalent to "R semi-join S". Also explain how you will estimate the size of a semi-join, using our ideas for estimating the size of natural joins?
(40 points) Assume that you are given "n" relations - R1, R2, ..., Rn. Each of the "R"s has only two attributes. The second attribute of a relation is the same as the first attribute of the "next" relation. e.g. R1 could be R1(X,Y), then R2 would be R2(Y,Z), R3 is R3(Z,L) and so on. Thus, the total number of "distinct" attributes in all the relations is (n+1).

Write a program in your favorite programming language (C/C++/Perl etc.) that determines an optimal join strategy for computing "R1 Join R2 .... Rn". Assume that all relations are of size 1000. The program will take as input the "V" information, for only the "common" attributes (you don't really need any other data). The number of "V"s needed is 2(n-2) + 2. Your program will take them as command line arguments. For example, one invocation of your program would be:

> my_algorithm 200 100 500 20

This attempts to do a join of three relations: R(X,Y), S(Y,Z), T(Z,L), where #(R) = #(S) = #(Y) = 1000, and V(R,Y) = 200, V(S,Y) = 100, V(S,Z) = 500 and V(T,Z) = 20. You can restrict your answer to only left-deep joins. Your program should output two "groupings":
- The grouping obtained by a dynamic programming search
- The grouping obtained by a greedy search
In each case, output your answer with suitable parentheses and/or indentation. For example, for the above "run", the answer should be:

Dynamic Programming Ordering: ((R2 Join R3) Join R1)
Greedy Ordering: ((R2 Join R3) Join R1)

(In this case, they are the same, but you cannot generally assume that they are identical). Recall that the dynamic programming approach searches through all possible O(n!) combinations for left-deep trees and picks out the ones that are "cheapest". What does "cheapest" mean? Recall again that cost of a join is the sum of the sizes of all the intermediate relations considered. The greedy algorithm, on the other hand, picks the two relations whose estimated join is smallest, then incrementally adds relations (such that the join size is minimized) at each step. It thus makes local decisions at every time. As another example, the answer for:

> my_algorithm 200 100 500 20 50 1000

is:

Dynamic Programming Ordering: (((R3 Join R4) Join R2) Join R1)
Greedy Ordering: (((R3 Join R4) Join R2) Join R1)

Again, notice that the greedy answer is the same as the DP one, but this is not generally the case.

What you should turnin: A location on the csgrad lab machines, in your home directory, where we can go and test out this ".exe" file. You are welcome to improvise as much as you want, so long as you satisfy the stated requirements. We will not provide the "data" (nor tell you how many "n"s there are going to be) on which we are going to test your program beforehand, but feel free to exchange test data (not code!) among yourselves.
FAQ: This question is downright sadistic! What is the point in asking us to implement this painful program?

Answer: Our goal here is to give you an appreciation for how a query compiler chooses a logical plan. The actual code doesn't take more than 2 pages of any high level lang. code. It is our belief that the only way to "gain this appreciation"is to get your hands dirty and code this up. (You may have some surprises when you go from three relations to four relations)