CS 5614 Homework #1

Solution Sketches

The reason why the record-based network model did not allow many-many relationships is that the original designers wanted to avoid the use of variable-length records. A many-one relationship, on the other hand, can be implemented in a circular fashion using only one pointer per record. Hence, all records (of a given type) will have the same length, which in turn provides more efficient file system organization.
The key issue is to first notice why the question talks about secondary indices vs. table scans as opposed to primary indices vs. table scans. The reason is that by indicating secondary indices, we are secretly implying a dense index! :-)

Part a: If each block contains 22 records, we will require about 90910 blocks. The query returns 75,000 records. The probability that each block has "an answer record" is pretty high: > 82%. While the secondary index will need to use random access to retrieve every record in the answer, the table scan will use fewer accesses. Thus, the latter is better.

Part b: In contrast, if each block has only 2 records, we will require about a million blocks. The query, on the other hand, will thus touch only 7.5% of the total blocks. Hence, a secondary index will be very fruitful.

Moral of the Story: Use a secondary index when the average query that uses the index will retrieve far fewer records than there are blocks (thus, you need both a large number of records + high discrimination, to justify the implementation of a secondary index).
This question seems to have been the most confusing. :-) If your answer doesn't closely match what we have here, don't panic! All we are looking for is if you made reasonable assumptions and if the number of blocks in your answers have the same relative ordering w.r.t. the ones here. So, for example, part (4) should have the highest number of blocks and use 4 disk accesses, and so on.
1. No. of blocks = 100000 (for the lowest level) + 14286 (which is 1000000/70, since it is a dense index) + 205 (which is 14286/70) + 3 ( which is 205/70) + 1 (for the lone root) = 114495 blocks. Your answers will be different if you assumed 69 pointers (leaving one pointer for some record keeping). The number of disk accesses required = 5.
2. The index is still dense, so nothing changes here.
3. No. of blocks = 100000 (for the lowest level) + 1429 (which is 100000/70) + 21 (1429/70) + 1 (for the root) = 101451 blocks. The number of disk accesses = 4.
4. No. of blocks = 142858 (for the lowest level, 1000000/7) + 2041 (which is 142858/70) + 30 + 1 (for the root) = 144930 blocks. The number of disk accesses = 4.
The first level index could be sparse or dense (as in the single-level case). But future levels *must* definitely be sparse (bcoz it doesn't make sense to have two levels of dense indices; it would only be a wastage of pointers). In fact, the need for multilevel indices was one of the motivating reasons for the use of B-trees in database systems. You can think of the B-tree as a multilevel index with a more sophisticated algorithmic underpinning.
An important goal of a DBMS data organization is to "Try to organize blocks so that if we read a block, we use much of the data inside that block" and don't have to go too many times to "get other blocks". One example of such a data organization is a "clustered file" where records belonging to two different relations are actually clustered together based on some access patterns. For example, assume that we require queries such as "Find the office-number of the department where Naren is an instructor" and "Find all the instructors in the CS department". Notice that both queries involve two relations --- "departments" and "instructors". If we knew that such queries are common, we can take advantage of this fact and put information pertaining to the departments and instructors together. Thus, we could have a clustered file with interspersed records where one department lists all instructors followed by the next department and so on. The goal is that records that will be retrieved together are "clustered" in the same block, so the retrieved blocks are more likely to contain all the info. you need. Thus, we would have a secondary index on "Instructor Name" even though it is the primary key for the Instructor relation!

The other obvious example is the "heap" data structure which doesn't impose any ordering on the individual records. In other words, data is just jumbled up together in the order in which they arrived (presumably). Now, what use is such a data structure? The only real advantage comes when you are reading the whole relation (the table scan operation, again!), in no particular order. Thus, heaps provide both storage efficiency and fast table scans. Insertions are easier than other data structures, because they don't involve any "searching" (before inserting). Of course, they are pretty inefficient for searches, deletions, etc.

If your answer was hash-tables, it is a moot-point whether they satisfy the requirements of the question (it depends on what the hashing function is and what your assumptions are).