CS 5614 Homework #4

Solution Sketches

(a) First notice that the "splitting" is not being performed according to the normal rules of BCNF. If the FD A->B is indeed in violation, we should get the relational schemas as {A,B}, {A,C} and not {A,B}, {B,C}. The only way in which this decomposition could be lossless would be if it was performed on the lines of some other violating FD. Since they share the B attribute, the only choices we have are: B->A or B->C. Let's pick B->C as the violating FD. We thus have two FDs for the relation: {A->B, B->C}. The key would be {A}. The second FD is indeed in violation of BCNF (as planned) since its left hand side is not a superkey. An example instance would thus be:
```
a1 b1 c1
a2 b2 c2
a3 b2 c2
a4 b1 c1
```
If you started with the other FD, you will get a solution of a different type.
(b) An instance of R for which the given decomposition is lossy would be:
```
a1 b1 c1
a2 b1 c2
```
Decomposing it, we get:
```
a1 b1
a2 b1
```
and
```
b1 c1
b1 c2
```
Joining them back gives, instead:
```
a1 b1 c1
a1 b1 c2
a2 b1 c1
a2 b1 c2
```
i.e., we get two extra tuples. The decomposition is thus lossy. How did we arrive at this example? Notice how we did the decomposition. We performed it based on the assumption that either B->C or B->A holds. I have designed the example (cleverly) to ensure that both of these FDs are violated. There is no way we will get the original tuples back by performing a decomposition based on an FD(s) that do not even hold in the first place!
We need to first determine the key(s). Notice that A, B and E do not appear on the right side of any FD. Therefore, they have got to be part of a key. Let us see if {A,B,E} by itself could be the key. The closure of {A,B,E} = {A,B,C,D,E}. Thus, indeed the only key is {A,B,E}.
1. We first need to systematically determine all the FDs that hold in R. This is done by computing the closures of all subsets of R. If you do this meticulously, you will notice that you will get an additional FD BE->C. The FDs that are in violation of BCNF are all of them(!), since they do not have a superkey on the left. In other words, the violating FDs are:
```
AB->C
DE->C
B->D
BE->C
```
2. Let us try to decompose it by the first FD: AB->C. We get {A,B,C},{A,B,D,E}. In {A,B,C}, the only FD that holds in AB->C. Since AB is the key, there is no violation of BCNF. In {A,B,D,E}, the FD B->D is in violation of BCNF. We decompose it further to obtain {B,D},{B,A,E}. Each of these is in BCNF (why?). The final relations are therefore {{A,B,C},{B,D},{A,B,E}}.
3. All the original FDs are in violation of 3NF too, since in addition to none of the left hand sides being a superkey, none of the right hand sides has part of a key.
4. Assume that we decompose by B->D. We get {{B,D},{B,A,C,E}}. In the second schema AB->C, BE->C hold. The key is still {A,B,E}. Both are in violation of 3NF. Lets decompose via AB->C. We get {{A,B,C},{A,B,E}}. Each of these is in 3NF. The final schemas are {{B,D},{A,B,C},{A,B,E}}.
  
  Rheena Khosla pointed out that this decomposition does not preserve dependencies (notice that BE->C and DE->C cannot be preserved without complicated joins), though we learnt in class that 3NF is good at preserving FDs. The actual algorithm for performing a dependency preserving decomposition is beyond the scope of this course. If interested, please consult the Cow book or the Boat book. The itty-bitty-dirty fix to this problem is to add another schema {B,C,D,E} that helps us preserve both the FDs BE->C and DE->C. (In other words, before tuples are added to the other schemas, this schema will ensure that the FDs are preserved).
The only way we will know for sure if we got all the FDs is to compute closures of all subsets of S (we take the closure with respect to the original relation R, to make sure we don't miss anything; of course we can always throw away the attributes that are not in S).
```
Closure of A = {A,D} 
Closure of B = {B} 
Closure of C = {C}
Closure of AB = {A,B,D,E}
Closure of AC = {A,C,D,E,B} This gives AC->B
Closure of BC = {B,C}
```
Thus, the only FD we can state for S is AC->B.
Every 4NF relation has to be in BCNF first. Let us therefore first render the relation in BCNF. The key is {A,B,C} and the FD B->D is in violation. We therefore get {{B,D},{B,A,C}}. There are no more FDs and these sub-relations are therefore in BCNF. Let us now test if they are in 4NF. The MD AB->->C holds in the latter relation but it is a trivial MD. Therefore, the final answer is {B,D},{A,B,C}}
We are given that X->->Y and X->->Z. We would like to prove that X->->(Y-Z) holds. In other words, for a given X, there are many choices of (Y-Z). Another way of saying the same is that X and (Y-Z) are independent of each other. Now, notice that (Y-Z) is a subset of Y and hence Y->(Y-Z) holds. From which, we can derive that Y->->(Y-Z). Since X->->Y and Y->->(Y-Z), we can apply the transitivity rule of MDs to obtain: X->->(Y-Z).
R is given to be in 3NF. This means that for any FD X->Y in R, either X will be the superkey or Y has part of a key. But since keys consist of only one attribute, this effectively means that Y is a key. Now, what can determine a key completely? It has to be another superkey on the left (since the closure of it will lead to the set of all attributes). Hence, for any FD X->Y in R, we can guarantee that X is a superkey, which is the condition for BCNF. Q.E.D. (Note: Though this looks like an informal argument, one can write a simple proof based on it.) The converse is trivially true, since every relation in BCNF is also in 3NF (nothing to do with whether the key is simple or not!)
If the MD A->->B holds, then the MD A->->C also holds. In general, this problem might be tricky to solve, but notice that all the A values are the same. We are given:
```
a b1 c1
a b2 c2
a b2 c3
a b3 c3
```
Since there are three B values, and three C values (for the same A), we can confidently state that there will be 9 combinations. The question only lists four. We can thus state the following five tuples:
```
a b1 c2
a b1 c3
a b2 c1
a b3 c1
a b3 c2
```
We are given:
```
a1 b1 c1
a1 b1 c2
a2 b1 c1
a2 b1 c3
```
We can try A->->C (from the fact that a1 "goes to" both c1 and c2), which will also lead us to A->->B. a1 goes to two C's and one B. Thus, there are two tuples with a1. a2 goes to one B and two C's. There are indeed two tuples with a2. Thus, A->->C and A->->B hold. However, B->->A and B->->C do not hold. To see why, assume that they hold. Notice that b1 goes to two A's and three C's. We would therefore expect to see six tuples with b1. However, the tuples (a1,b1,c3),(a2,b1,c2) are absent, therefore these MDs possibly cannot hold. The only other MDs that hold are C->->A (and therefore C->->B). Of course, there are the six other trivial MDs (these will always hold in any three-attribute relation):
```
A->->BC
AB->->C
AC->->B
B->->AC
BC->->A
C->->AB
```
The relational model has remained remarkably robust since the early 70's when it was first formulated by Codd. Here's our analysis. We know we exceeded our own space limits, but for graduate students embarking on a research program, we hope that this serves as an illustration of "literature surveys" that will be expected of you in your careers. :-)
- Data Modeling:
  1. Codd himself improved on his original relational model in his Turing award address, and future papers and books. What we have now is a primary descendant of his models.
  2. Substantial improvements in data modeling have occured, particularly with the advent of the E/R model proposed by P.P. Chen at MIT ("The Entity-Relationship Model: Towards a Unified View of Data", ACM Transactions on Database Systems, Vol. 1, No. 1, pages 9-36, January 1976). These models permit the logical modeling of data in a fashion that closely mimics the relational way, at the same time allowing a higher level of abstraction. Most current textbooks cover variants of this model that include abstraction, inheritance, aggregation and classification. Such models also permit the delineation of modeling from implementation, which was one of the goals of Codd's paper.
  3. Most commercial systems violate the original maxim of Codd's relational model by allowing duplicate tuples in relations.
  4. The relational model is also currently criticized for its inability to meet the needs of users from domains other than what it was intended for. Object oriented models have been proposed with the net-result that relational and OO systems meet mid-way to provide various species of object-relational systems. The primary purpose of such systems is to provide richer semantics and support for enriched data types.
  5. With the advent of the World Wide Web, the idea that database techniques can be used in information retrieval has gained momentum. This has led to the development of semi-structured models that employ graph-based representations to abstract schema, and allow imprecise queries (using formats such as XML etc.). A good example of a semi-structured database system is the Lore project at Stanford. Facilities for non-traditional forms of data have also been explored, like images, sounds, etc.
  6. Multidimensional database systems were once proposed as an alternative to RDBMSs. Though this is yet to catch on in the original way intended, they now form the basis of data warehousing systems. Such systems store historical and archival data as opposed to storing transactional data (that runs the business). They also store an order-of-magnitude more data than traditional database systems.
  7. Deductive database systems employing the logical model have been proposed as an extension to RDBMSs. These provide deduction facilities similar to those provided by PROLOG, but also promise efficiency of query answering.
- Constraints and Dependencies:
  1. Codd's original paper identifies various goals of a RDBMS in terms of the flavors of "independence" it should support. Most of the constraints and dependencies modeled by relational design techniques can be seen as centered around this idea. Database systems routinely support the declaration of key constraints and referential integrity constraints. They also hide implementation details from the user of the database system.
  2. In addition, active database elements such as domain-specific constraints and triggers have become necessary to encode application semantics into the database system. Though these took longer in coming, they are now accepted as part of mainstream database facilities.
  3. Fagin introduced a new notion of dependencies called MDs in ("Multivalued dependencies and a new normal form for relational databases", ACM Transactions on Database Systems, Vol. 2, No. 3, pages 262-278, 1977). More notions of dependencies, such as Join Dependencies, followed.
  4. Bernstein is credited with discovering the algorithm for testing an FD by computing the closure of a set of attributes. The reference for this is ("Synthesizing Third Normal Form Relations from FDs", ACM Transactions on Database Systems, Vol. 1, No. 4, pages 277-298, 1976.).
  5. A complete axiomatization of a theory of FDs was first propounded by Armstrong (after which they are named) and subsequently pursued by Beeri, Fagin et.al. (C. Beeri, R. Fagin, J.H. Howard, "A complete axiomatization for FDs and MDs", ACM SIGMOD, pages 47-61, 1977).
  6. Algorithmic research into reasoning with dependencies has since exploded after Codd's original paper. These have given rise to improved and sophisticated techniques for commercial implementations.
- Normal Forms:
  1. Codd's original paper contained the definition of 3NF and he later followed it up in a paper with what is now known as BCNF.
  2. Fagin also introduced 4NF with his new notion of MDs. This thread of research gave rise to further specializations of normal forms.
  3. Desirable properties for decompositions have been carefully studied and normal forms have been characterized based on the properties they absolutely guarantee over others.