- (a) First notice that the "splitting" is not being performed
according to the normal rules of BCNF. If the FD A->B is
indeed in violation, we should get the relational schemas as
{A,B}, {A,C} and not {A,B}, {B,C}. The only way in which
this decomposition could be lossless would be if it was performed
on the lines of some other violating FD.
Since they share the B attribute, the only choices
we have are: B->A or B->C.
Let's pick B->C as the
violating FD.
We thus have two FDs for the relation: {A->B, B->C}.
The key would be {A}. The second FD is indeed in violation
of BCNF (as planned) since its left hand side is not a superkey.
An example instance would thus be:
a1 b1 c1
a2 b2 c2
a3 b2 c2
a4 b1 c1
If you started with the other FD, you will get a solution of a
different type.
(b) An instance of R for which the given decomposition is lossy would
be:
a1 b1 c1
a2 b1 c2
Decomposing it, we get:
a1 b1
a2 b1
and
b1 c1
b1 c2
Joining them back gives, instead:
a1 b1 c1
a1 b1 c2
a2 b1 c1
a2 b1 c2
i.e., we get two extra tuples. The decomposition is thus lossy.
How did we arrive at this example? Notice how we
did the decomposition. We performed it based
on the assumption that either
B->C or B->A holds. I have designed the example (cleverly) to
ensure that both of these FDs are violated. There is no way we will
get the original tuples back by performing a decomposition based
on an FD(s) that do not even hold in the first place!
- We need to first determine the key(s). Notice that A, B and E do
not appear on the right side of any FD. Therefore, they have got to be
part of a key. Let us see if {A,B,E} by itself could be the key. The closure
of {A,B,E} = {A,B,C,D,E}. Thus, indeed the only key is {A,B,E}.
- We first need to systematically determine all the FDs that hold in R.
This is done by computing the closures of all subsets of R. If you
do this meticulously, you will notice that you will get an additional FD
BE->C.
The FDs that are in violation of BCNF are all of them(!), since they
do not have a superkey on the left. In other words, the violating FDs are:
AB->C
DE->C
B->D
BE->C
- Let us try to decompose it by the first FD: AB->C. We get {A,B,C},{A,B,D,E}.
In {A,B,C}, the only FD that holds in AB->C. Since AB is the key, there is
no violation of BCNF. In {A,B,D,E}, the FD B->D is in violation of BCNF.
We decompose it further to obtain {B,D},{B,A,E}. Each of these
is in BCNF (why?). The final relations are
therefore {{A,B,C},{B,D},{A,B,E}}.
- All the original FDs are in violation of 3NF too, since in addition
to none of the left hand sides being a superkey, none of the right hand
sides has part of a key.
- Assume that we decompose by B->D. We get {{B,D},{B,A,C,E}}. In
the second schema AB->C, BE->C hold. The key is still {A,B,E}.
Both are in violation of 3NF. Lets decompose via AB->C. We get
{{A,B,C},{A,B,E}}. Each of these is in 3NF. The final schemas
are {{B,D},{A,B,C},{A,B,E}}.
Rheena Khosla pointed out that this decomposition does not preserve
dependencies (notice that BE->C and DE->C cannot be preserved
without complicated joins), though we learnt in class that 3NF is good
at preserving FDs. The actual algorithm for performing a dependency
preserving decomposition is beyond the scope of this course. If interested,
please consult the Cow book or the Boat book. The itty-bitty-dirty fix
to this problem is to add another schema {B,C,D,E} that helps us preserve
both the FDs BE->C and DE->C. (In other words,
before tuples are added to the other schemas,
this schema will ensure that the FDs are preserved).
- The only way we will know for sure if we got all the FDs
is to compute closures of all subsets of S (we take
the closure with respect to the original relation R, to make
sure we don't miss anything; of course we can always throw
away the attributes that are not in S).
Closure of A = {A,D}
Closure of B = {B}
Closure of C = {C}
Closure of AB = {A,B,D,E}
Closure of AC = {A,C,D,E,B} This gives AC->B
Closure of BC = {B,C}
Thus, the only FD we can state for S is AC->B.
- Every 4NF relation has to be in BCNF first. Let us therefore first
render the relation in BCNF. The key is {A,B,C} and the FD B->D is
in violation. We therefore get {{B,D},{B,A,C}}. There are
no more FDs and these sub-relations are therefore in BCNF. Let us
now test if they are in 4NF. The MD AB->->C holds in the latter
relation but it is a trivial MD. Therefore, the final answer is
{B,D},{A,B,C}}
- We are given that X->->Y and X->->Z. We would like
to prove that X->->(Y-Z) holds. In other words, for a given X,
there are many choices of (Y-Z). Another way of saying the same
is that X and (Y-Z) are independent of each other. Now, notice
that (Y-Z) is a subset of Y and hence Y->(Y-Z) holds. From which,
we can derive that Y->->(Y-Z). Since X->->Y and Y->->(Y-Z), we
can apply the transitivity rule of MDs to obtain: X->->(Y-Z).
- R is given to be in 3NF. This means that for any FD X->Y in R,
either X will be the superkey or Y has part of a key. But since keys
consist of only one attribute, this effectively means that Y is a key.
Now, what can determine a key completely? It has to be another
superkey on the left (since the closure of it will lead to the set of
all attributes). Hence, for any FD X->Y in R, we can guarantee that X
is a superkey, which is the condition for BCNF. Q.E.D. (Note: Though
this looks like an informal argument, one can write a simple proof
based on it.) The converse is trivially true, since every relation
in BCNF is also in 3NF (nothing to do with whether the key is
simple or not!)
- If the MD A->->B holds, then the MD A->->C also holds.
In general, this problem might be tricky to solve, but notice that all
the A values are the same.
We are given:
a b1 c1
a b2 c2
a b2 c3
a b3 c3
Since there are three B values, and three C values (for the same A),
we can confidently state that there will be 9 combinations.
The question only lists four.
We can thus state the following five tuples:
a b1 c2
a b1 c3
a b2 c1
a b3 c1
a b3 c2
- We are given:
a1 b1 c1
a1 b1 c2
a2 b1 c1
a2 b1 c3
We can try A->->C (from the fact that a1 "goes to" both c1 and c2),
which will also lead us to A->->B. a1 goes to two C's and one B. Thus, there
are two tuples with a1. a2 goes to one B and two C's. There are indeed
two tuples with a2. Thus, A->->C and A->->B hold.
However, B->->A and B->->C do not hold.
To see why, assume that they hold. Notice that b1 goes to two A's and three
C's. We would therefore expect to see six tuples with b1.
However,
the tuples (a1,b1,c3),(a2,b1,c2) are absent, therefore
these MDs possibly cannot hold. The only other
MDs that hold are C->->A (and therefore C->->B). Of course, there are the six other
trivial MDs (these will always hold in any three-attribute relation):
A->->BC
AB->->C
AC->->B
B->->AC
BC->->A
C->->AB
- The relational model has remained remarkably robust since the early
70's when it was first formulated by Codd. Here's our analysis. We know
we exceeded our own space limits, but
for
graduate students embarking on a research program, we hope that
this serves as an illustration of "literature surveys" that will be
expected of you in your careers. :-)
- Data Modeling:
- Codd himself improved on his original relational model in his
Turing award address, and future papers and books. What we have now is
a primary descendant of his models.
- Substantial improvements in data modeling have occured, particularly
with the advent of the E/R model proposed by P.P. Chen at MIT ("The Entity-Relationship Model: Towards a Unified View of Data", ACM Transactions on Database
Systems, Vol. 1, No. 1, pages 9-36, January 1976). These
models permit the logical modeling of data in a fashion that closely
mimics the relational way, at the same time
allowing a higher level of abstraction. Most current textbooks cover
variants of this model that include abstraction, inheritance, aggregation
and classification. Such models also permit the delineation of modeling
from implementation, which was one of the goals of Codd's paper.
- Most commercial systems violate the original maxim of Codd's relational
model by allowing duplicate tuples in relations.
- The relational model is also currently criticized for its inability
to meet the needs of users from domains other than what it was intended for.
Object oriented models have been proposed with the net-result that
relational and OO systems meet mid-way to provide various species of
object-relational systems. The primary purpose of such systems is to provide
richer semantics and support for enriched data types.
- With the advent of the World Wide Web, the idea that database techniques
can be used in information retrieval has gained momentum. This has led
to the development of semi-structured models that employ graph-based
representations to abstract schema, and allow imprecise queries (using
formats such as XML etc.). A good example of a semi-structured database system
is the Lore project at Stanford.
Facilities for non-traditional forms of data have also been explored, like
images, sounds, etc.
- Multidimensional database systems were once proposed as
an alternative to RDBMSs. Though this is yet to catch on in the original way
intended, they now form the basis of data warehousing systems. Such systems store historical and archival data as opposed to storing
transactional data (that runs the business). They also store
an order-of-magnitude more data than traditional database systems.
- Deductive database systems employing the logical model have been proposed
as an extension to RDBMSs. These provide deduction facilities similar to
those provided by PROLOG, but also promise efficiency of query answering.
- Constraints and Dependencies:
- Codd's original paper identifies various goals of a RDBMS in terms of
the flavors of "independence" it should support. Most of the constraints
and dependencies modeled by relational
design techniques can be seen as centered around this idea. Database systems
routinely support the declaration of key constraints and referential
integrity constraints. They also hide implementation details from
the user of the database system.
- In addition, active database elements such as domain-specific constraints
and triggers have become necessary to encode application semantics into
the database system. Though these took longer in coming, they are now
accepted as part of mainstream database facilities.
- Fagin introduced a new notion of dependencies called MDs in
("Multivalued dependencies and a new normal form for relational databases",
ACM Transactions on Database Systems, Vol. 2, No. 3, pages 262-278, 1977).
More notions of dependencies, such as Join Dependencies, followed.
- Bernstein is credited with discovering the algorithm for testing an FD
by computing the closure of a set of attributes. The reference for this
is ("Synthesizing Third Normal Form Relations from FDs", ACM
Transactions on Database Systems, Vol. 1, No. 4, pages 277-298, 1976.).
- A complete axiomatization of a theory of FDs was first propounded by
Armstrong (after which they are named) and subsequently pursued by
Beeri, Fagin et.al. (C. Beeri, R. Fagin, J.H. Howard, "A complete
axiomatization for FDs and MDs", ACM SIGMOD, pages 47-61, 1977).
- Algorithmic research into reasoning with dependencies has since
exploded after Codd's original paper. These have given rise to improved
and sophisticated techniques for commercial implementations.
- Normal Forms:
- Codd's original paper contained the definition of 3NF and he later
followed it up in a paper with what is now known as BCNF.
- Fagin also introduced 4NF with his new notion of MDs. This thread
of research gave rise to further specializations of normal forms.
- Desirable properties for decompositions have been carefully
studied and normal forms have been characterized based on the properties
they absolutely guarantee over others.