CMSC724: Database Management Systems, Fall 2024

Aug 27, Aug 29: Introduction/Overview, and Background [show/hide]

In the first two clases, we will cover:
- History of database management systems, what problems they are trying to solve, the state of the art, and what we plan to cover in this semester.
- Relational Model, SQL, and how traditional relational database management systems are built (Architecture paper below).
- How to think broadly about the world of data management systems today, and how to place different works in context. Specifically, we will discuss the key design decisions that need to be made for a data management system (data models, languages/programming frameworks, storage models, transaction support, guarantees, etc.), and what the impact of those design decisions might be. Most of this discussion will be at a high level, and we will dive deeper into the specifics for a few of those topics later in the course. We will also talk about some of the orthogonal but important concerns like streaming systems, and immutability.
Slides/Notes: [1-Introduction], [424 Summary], [424 ALL Slides] (419-page PDF).
Main readings:
1. "Architecture of a Database System"; Joe Hellerstein, Mike Stonebraker, James Hamilton; Foundations and Trends 2007. (original paper; A crop-merged Version of that PDF): An overview of how traditional database management systems are built, and some of the key design considerations.
Additional Readings:
1. Database System Concepts; Avi Silberschatz, Henry F. Korth, S. Sudarshan: This is the textbook for the undergraduate class and may be a good reference to brush up on any background that we don't cover in depth in class. [link]
2. For a nice historical overview of database management systems, see the first paper ("Evolution of ..") in this ACM Computing Surveys, Mach 1976
3. Concurrency Control and Recovery; Mike Franklin, 1997 [pdf link]: We won't cover this topic much in this class, but this paper provides a nice introduction to it that you should be comfortable with.
4. Is your database relational? Ted Codd Wikiepdia article

Sept 3-5-10-12: Data Models, Languages, and Programming Frameworks [show/hide]

Slides/Notes: [2-Models-Languages-Abstractions]
We will dive into the different data models, query languages, and programming frameworks that have been proposed over the years, for modeling and structuring data, and for querying and analyzing it. In particular, we will discuss: Relational Model and SQL, Document data model (e.g., MongoDB), Entity-relationship Model and ORM frameworks, Map-Reduce and Spark, Graph data models, REST, OLAP, Visualization, Machine Learning Frameworks (in varying levels of detail). We will also discuss the importance of "schemas" and the challenges of "schema evolution".
Main Readings:
1. "What goes around comes around"; Mike Stonebraker and Joe Hellerstein; Redbook.: This paper summarizes how "data models" evolved over the last 50 years.
2. A Survey of Research on Deductive Database Systems (Sections 1-4); Ramakrishnan and Ullman; 1993 (http://ilpubs.stanford.edu:8090/80/)
3. Declarative Networking: Language, Execution and Optimization (Sections 1, 2, and 4); Loo et al.; SIGMOD 2016.
4. MapReduce: A Flexible Data Processing Tool (Sections 1-2); Jeffrey Dean and Sanjay Ghemawat; CACM 2010
5. Resilient Distributed Datasets (Sections 1-4); Zaharia et al.; NSDI 2012.
6. SystemML: Declarative Machine Learning On MapReduce (Sections I-III(A)); Ghoting et al.; ICDE 2011.
7. GraphX: Graph Processing in a Distributed Dataflow Framework (Section 1-3); OSDI 2014 ([link])
Presentation Readings (September 12):
Optional Readings:
1. Database System Concepts; Avi Silberschatz, Henry F. Korth, S. Sudarshan. Two Appendixes covering network model and hierarchical model in detail are available on the book webpage (for the 6th edition). [link]
2. Joachim W. Schmidt. Some High Level Language Constructs for Data of Type Relation. ACM Transactions on Database Systems, 2(3), 1977, 247-261.
3. MapReduce and Parallel DBMSs: Friends or Foes? Stonebraker et al.; CACM 2010

Sept 17-19: No Class

No classes this week. Use this time to work on the first set of assignments.

Sept 24-26, Oct 1-3: Storage Models and Indexing [show/hide]

Slides/Notes: [3-Storage]
We will discuss how the data is stored on disks and in memory, and the impact of those design decisions. Specifically, we will discuss row and column storage formats for databases, and the prevalent storage formats for data lakes that are widely used today.
Main Readings:
1. Weaving Relations for Cache Performance; Ailamaki et al.; VLDB 2001.
2. Integrating compression and execution in column-oriented database systems; Abadi et al.; SIGMOD 2006.
3. Dremel: interactive analysis of web-scale datasets; Melnik et al.; SIGMOD 2010.
4. Delta lake: high-performance ACID table storage over cloud object stores; Armbrust et al.; SIGMOD 2020.
Presentation Readings (Oct 3):

Oct 8-10-15-17: Query Processing and Optimization - I [show/hide]

Slides/Notes: [4-Query-Part-1] [4-Query-Part-2]
We will go deeper into how queries are executed and optimized, focusing on more recent techniques for this. We will cover these topics for traditional relational database management systems, modern data warehouses, and data lakes.
Main readings:
1. Goetz Graefe: Query Evaluation Techniques for Large Databases (Sections 1 and 2). ACM Comput. Surv. 25(2): 73-170 (1993) [link]
2. Practical Skew Handling in Parallel Joins; VLDB 1996
3. P. Boncz, et al., MonetDB/X100: Hyper-Pipelining Query Execution; CIDR, 2005
4. Efficiently Compiling Efficient Query Plans for Modern Hardware; VLDB 2011 (you can skim Sections 4-)
Presentation Readings (Oct 17):

Oct 22-24-29-31: Query Processing and Optimization - II [show/hide]

Slides/Notes: See above.
We will go deeper into how queries are executed and optimized, focusing on more recent techniques for this. We will cover these topics for traditional relational database management systems, modern data warehouses, and data lakes.
Main readings:
1. Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43;
2. How good are query optimizers really?; VLDB 2015
3. Outerjoin Simplification and Reordering for Query Optimization (Section 1-3); TODS 1997
4. Extensible/Rule Based Query Rewrite Optimization in Starburst; SIGMOD 1992
5. Execution Strategies for SQL Subqueries; SIGMOD 2007
Presentation Readings (Oct 31):
1. Bao: Making learned query optimization practical; SIGMOD 2021.
2. Bitvector-aware Query Optimization for Decision Support Queries; SIGMOD 2020.
3. NeuroCard: one cardinality estimator for all tables; PVLDB 2021.

Nov 5-7-12-14: Query Processing and Optimization - III [show/hide]

Slides/Notes: See above.
We will go deeper into how queries are executed and optimized, focusing on more recent techniques for this. We will cover these topics for traditional relational database management systems, modern data warehouses, and data lakes.
Main readings:
1. Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously Adaptive Query Processing. SIGMOD, 2000.
2. Volker Markl, Vijayshankar Raman, David Simmen, Guy Lohman, Hamid Pirahesh, Miso Cilimdzic. Robust Query Processing Through Progressive Optimization. SIGMOD, 2004.
3. Permutable compiled queries: dynamically adapting compiled queries without recompiling; VLDB 2018.
4. Skew Strikes Back: New Developments in the Theory of Join Algorithms; SIGMOD Record 2013
5. Optimizing imperative functions in relational databases with Froid; VLDB 2018.
Presentation Readings (Nov 14):
1. Adopting worst-case optimal joins in relational database systems; VLDB 2020.
2. Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis; VLDB 2023.
3. Functional-Style SQL UDFs With a Capital ‘F’; SIGMOD 2020.

Nov 19-21, Dec 3-5: Data Streams and Dataflow Engines [show/hide]

Slides/Notes: [5-Streams]
We will spend a few classes discussing topics in processing data in real-time.
Main readings:
1. Maintenance of materialized views: Problems, techniques, and applications; A Gupta, IS Mumick - IEEE Data Eng. Bull., 1995
2. Models and issues in data stream systems; PODS 2002.
3. Discretized streams: fault-tolerant streaming computation at scale; SOSP 2013
4. Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow; VLDB 2021.
5. Incremental, Iterative Data Processing with Timely Dataflow; CACM 2016.
6. DBSP: Automatic incremental view maintenance for rich query languages; VLDB 2023.
Presentation Readings:
1. (December 3) Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm; EDBT 2013.
2. (December 3) KSQL: Streaming SQL Engine for Apache Kafka; EDBT 2019 (https://openproceedings.org/2019/conf/edbt/EDBT19_paper_329.pdf)
3. (December 5) Vortex: A Stream-oriented Storage Engine For Big Data Analytics; SIGMOD 2024.
4. (December 5) Durable Functions: Semantics for Stateful Serverless; OOPSLA, 2021.

CMSC724: Database Management Systems
Prof. Amol Deshpande; CSI 1121; Tue-Thu 11:00am-12:15pm

[Home]	[Schedule]	[Assignments]	[Resources]

Schedule and Readings