Monday, Aug 23, 1999 ------------------------ - Introduction to CS5614 - Logistics, Recommended Texts, Web page, etc. - Brief Historical Background of DBs (introduced to motivate the various aspects) Here's a synopsis: The first two entries are switched to better illustrate the sequence of research ideas 1973: Charles W. Bachman wins Turing Award - conceptualized that information can be stored and manip. - propounded the information-centric view and computer as a tool to manipulate information - designed networked databases, which along with hierarchical databases, were the precursors to RDBMSs - Architect behind the Honeywell IDS networkd DB - led to the formation of DBTG (Data Base Task Group) (the COBOL/CODASYL camp) 1970: Edgar F. Codd writes paper in CACM - proposes relational model - separate logical view from physical view - uses an algebra of operations to query/manipulate the DB 1975: ACM SIGMOD Meeting (The Great Debate) - Relational View: Set oriented queries and a good formal/mathematical model is important - Network View: Record oriented queries and efficiency is important; not possible to get efficiency with `R' From this point - 1980: Actual Development of RDBMSs; Two implementations - IBM, San Jose, CA - System R - Jim Gray, K.P. Eswaran (KPE) et.al. - Two-phase architecture - RDS: Relational Data System - RSS: Relational Storage System RSS provides support for links and images to allow backward compatibility with network DBs - RDS didn't use some of the facilities for RSS 'coz difficult to build optimizers for images etc. - KPE forms ESVAL, later formed basis of HP Database - Larry Ellison uses published specs to found ORACLE - IBM later produces DB2 and SQL/DS - UC Berkeley - Ingres - Michael Stonebraker, Wong, Held et.al. - used UNIX File System Support - Ingres taken over by Computer Associates - Rob Epstein helps build IDM and found Sybase 1981: Codd wins Turing Award 1989: Ingres group bags ACM Software Systems Award 1992: Stonebraker wins SIGMOD Innovations Award 1998: Gray wins Turing Award (Contributions to Transaction Processing) Gray is currently with Microsoft BARC - Relational Model - Basic Idea - Example of SQL - Implementation Decisions that shaped DBMSs - IBM's non-standard programmatic interface to SQL - Decision to allow duplicate tuples, leading to bag-semantics - Aspects of Studying DBMSs - Design - Programming - Implementation - CS5614 will cover all aspects! - Organizational Diagram of a DB: - (0) Data and Metadata - (i) Storage Management, - (ii) Query Optimizer, - (iii) Transcation Processor - Kinds of Memory Storage - Emphasis on Secondary Storage - cost - non-volatility - no addresability probs - Started on Query Optimizer ----------------------------------------------------- Acknowledgements and References: - Readings in DBs, Chapter 1 (The Roots), Stonebraker and Hellerstein, Eds. - Dr. Melli Annamalai, Personal Communication, ORACLE Corp. Wed, Aug 25, 1999 ------------------------ - Review of Relational Model, SQL - allows separation of logical organization from physical storage structure - enables declarative queries e.g. SELECT Name FROM Students WHERE age=30 - 3-Tiered Organization of DBs - Design of a Disk - Platters; magnetic material coated on surfaces - Spindle - Cylinders - Tracks - Sectors/Blocks (Indivisible Unit of Read/Write) - Access Time is a sum of three components - Seek Time: Time to locate the right cylinder and position the magnetic arm - Rotational Latency: Time taken to rotate the platter to position appropriate sector - Transfer Time: Actual time for operation - Total Time: Order of Milliseconds - Buffer Manager - Manages Main Memory and has replacement policies - Index (Indices) - A data strucuture that takes a property of a record as input and returns records that satisfy that property "quickly" - (to some extent) its choice dictates the actual physical data layout - Examples of Indices - Pointer (Cursor) Techniques (Originally used extensively in NDBs) - B-Trees - Hash Tables - Simple Example of Pointer Based Indices - Dense Indices - Sparse Indices - Definition of Sparse - At most one pointer from the data structure to each page/block/sector - Definition of Dense - One pointer to each record in the table - For indexing data that is sorted on disk - both dense and sparse are appropriate - referred to as primary index - Advantages of Dense - allows certain kinds of queries to be efficiently processed (range queries and pre-computing unions, intersections etc.) - Advantages of Sparse - more efficient use of blocks - For indexing data that is not already sorted on disk - only dense will do - referred to as secondary index - A primary index is one that enforces a particular clustering/layout of data - A secondary index is one that doesn't put any constraint on the data organization - Commercial Examples of Primary Indices - Sybase: Sparse - DB2: Dense - Oracle: Choice of Both - B-Trees - Generalization of Binary Search Tree - Balanced; All leaves at the same level - More than one child per node - Typically, one block (4096 bytes) can have as many as 340 pointers for integer valued keys (4 bytes) and pointers (8 bytes) - A typical B-tree does not have to extend beyond three levels - A typical B-tree does not have to extend beyond two levels - B-Tree allows efficient operations: - Complexity is typically measured in disk accesses, not just time - Insertion: O(h) - Deletion: O(h) - Splitting a Node: O(h) (why would we want to do this?) - Lookup: Theta(h) (i.e. it is both upper-bounded and lower-bounded by h) - h is the height of the tree - Time Complexity: Multiply the above figures by the number of keys/block (assuming linear search within a block) - Can be spedup even more if we use binary search within each block - Two ways to organize B-trees - Put data only at the lowest level (provides higher fanout) - Put data from the beginning (restricts fanout) - Moral of the Story from Storage Manager - "Algorithms and data structures for secondary storage are qualitatively different from those that operate on main memory" - Moral of the Story from Query Processor - "It is possible to be both efficient and declarative" - other corners of this spectrum: PROLOG, Network DBs Fri, Aug 27, 1999 ------------------------ - Review of ACID Properties - Techniques used to achieve each of the properties: A - Commitment C - Integrity constraints, active elements etc. I - Locks D - Logging - Locking - choice of record level locks vs. relation (page) level locks - tradeoffs, advantages, etc. - Examples of systems that use various levels of locks - 2PL: most commonly used technique of locking - Logging - performed on nonvolatile storage - WAL (Write Ahead Logging) - Active Database Elements - overcome passive nature of RDBMSs - Constraints - Triggers - Limitations of Active Elements - do not adequately support collaborative and workflow systems (that require richer domain semantics and multilevel elements) - inapplicable for networked, loosely coupled systems (with large periods of disconnect) - semantically very limited in expressiveness - Data Modeling - Use a design language: e.g. E/R Modeling - Introduction to E/R - Entity Sets and Relationships - Diagrammatic Notation: Rectangles, Ellipses, Diamonds and Arrows - Students, Courses, and a many-many relationship - many-one relationship - one-one relationship - multiway relationship - many-one, one-one as specializations of many-many