CS 5614 Homework #1
Solution Sketches
- The reason why the record-based network model did not
allow many-many relationships is that the
original designers wanted to avoid
the use of variable-length records. A many-one relationship,
on the other hand, can be implemented in a circular fashion
using only one pointer per record. Hence, all records (of a
given type) will have the same length, which in turn
provides more efficient file system organization.
- The key issue is to first notice why the question talks about
secondary indices vs. table scans as opposed to primary indices vs. table
scans. The reason is that by indicating secondary indices, we
are secretly implying a dense index! :-)
Part a: If each block contains 22 records, we
will require about 90910 blocks. The query returns 75,000 records.
The probability that each block has "an answer record" is pretty
high: > 82%. While the secondary index will need to use random
access to retrieve every record in the answer, the table scan will
use fewer accesses. Thus, the latter is better.
Part b: In contrast, if each block has only 2 records,
we will require about a million blocks. The query, on the
other hand, will thus touch only 7.5% of the total blocks.
Hence, a secondary index will be very fruitful.
Moral of the Story: Use a secondary index when the average
query that uses the index will retrieve far fewer records
than there are blocks (thus, you need both a large number of records
+ high discrimination, to justify the implementation of a secondary
index).
- This question seems to have been the most confusing. :-) If your
answer doesn't closely match what we have here, don't panic! All we
are looking for is if you made reasonable assumptions and if the
number of blocks in your answers have the same relative ordering w.r.t.
the ones here. So, for example, part (4) should have the highest number
of blocks and use 4 disk accesses, and so on.
- No. of blocks = 100000 (for the lowest level) + 14286 (which
is 1000000/70, since it is a dense index) + 205 (which is 14286/70) + 3 (
which is 205/70) + 1 (for the lone root) = 114495 blocks. Your
answers will be different if you assumed 69 pointers (leaving one pointer for
some record keeping). The number of disk accesses required = 5.
- The index is still dense, so nothing changes here.
- No. of blocks = 100000 (for the lowest level) + 1429 (which
is 100000/70) + 21 (1429/70) + 1 (for the root) = 101451 blocks.
The number of disk accesses = 4.
- No. of blocks = 142858 (for the lowest level, 1000000/7) + 2041 (which is 142858/70)
+ 30 + 1 (for the root) = 144930 blocks. The number of disk accesses = 4.
- The first level index could be sparse or dense (as in the
single-level case). But future levels *must* definitely
be sparse (bcoz it
doesn't make sense to have two levels of dense indices; it would
only be a wastage of pointers). In fact, the need
for multilevel indices was one of the motivating reasons for
the use of B-trees in database systems. You can think of the B-tree as
a multilevel index with a more sophisticated algorithmic underpinning.
- An important goal of a DBMS data organization is to "Try to
organize blocks so that if we read a block, we use much of the data
inside that block" and don't have to go too many times to "get other
blocks".
One example of such a data organization is a "clustered file" where
records belonging to two different relations are actually clustered
together based on some access patterns. For example, assume that we require
queries such as "Find the office-number of the department where Naren is
an instructor" and "Find all the instructors in the CS department".
Notice that both queries involve two relations --- "departments"
and "instructors". If we knew that such queries are common,
we can take advantage of this fact and put information pertaining
to the departments and instructors together. Thus, we could have
a clustered file with interspersed records where one department
lists all instructors followed by the next department and so on.
The goal is that records that will be retrieved
together are "clustered" in the same block, so the retrieved blocks
are more likely to contain all the info. you need. Thus, we
would have a secondary index on "Instructor Name" even though it
is the primary key for the Instructor relation!
The other obvious example is the "heap" data structure which doesn't
impose any ordering on the individual records. In other words, data
is just jumbled up together in the order in which they arrived
(presumably). Now, what use is such a data structure? The only
real advantage comes when you are reading the whole relation (the
table scan operation, again!), in no particular order. Thus,
heaps provide both storage efficiency and fast table scans. Insertions
are easier than other data structures, because they
don't involve any "searching" (before inserting).
Of course, they are pretty inefficient for searches, deletions, etc.
If your answer was hash-tables, it is a moot-point whether they
satisfy the requirements of the question (it depends on what the
hashing function is and what your assumptions are).