Schedule and Readings
**This is a tentative schedule with relevant readings listed, and will be filled out with more details over the first two weeks of the semester. Only a few of these
readings will be required readings (1-2 per class).**
- Weeks 1 and 2: Background
[show/hide]
- Brief Description: A brief overview of the key components of a traditional database system, and newer data management systems (specifically, Apache Spark). Prior work on auto-tuning.
- Slides/Notes: [Introduction], [Background], [AutoAdmin]
- Readings - Jan 31, 2023:
- (Required) Resilient Distributed Datasets; Zaharia et al.; NSDI 2012.
- (Optional) "Architecture of a Database System"; Joe Hellerstein, Mike Stonebraker, James Hamilton; Foundations and Trends 2007. (original paper; A crop-merged Version of that PDF): An overview of how traditional database management systems are built, and some of the key design considerations.
- Readings - Feb 2, 2023:
- (Required) Self-Tuning Database Systems: A Decade of Progress (VLDB 2007)
- (Optional) Make Your Database System Dream of Electric Sheep: Towards Self-Driving Operation (VLDB 2021 from Andy Pavlo)
- Relevant Readings:
- Towards instance-optimized data systems (VLDB 2021 from Tim Kraska)
- AI Meets Database: AI4DB and DB4AI (SIGMOD 2021)
- openGauss: An Autonomous Database System (VLDB 2021 from Guoliang Li)
- Weeks 3, 4, 5: Learned Indexes, Storage Layouts
[show/hide]
- Brief Description: Learned indexes are a new class of indexes that use machine learning algorithms to improve the performance and functionality of "search". This topic has probably seen the most work in the recent years as the problems can be defined cleanly (relatively speaking) and indexes have less complex interactions with/dependencies on other modules in the system.
- Slides/Notes: [Learned Indexes], [QD-Trees; Bloomfilters], [Learned LSMs], [AI
meets AI; Multi-d Indexes],[Materialized Views],[Poisoning Attacks],[Query Processing/Optimization
Background]
- Readings - Feb 7, 2023:
- (Required) The Case for Learned Index Structures. SIGMOD 2018. Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean, Neoklis Polyzotis
- Readings - Feb 9, 2023:
- (Required) Qd-tree: Learning data layouts for big data analytics; SIGMOD 202.
- (Required) Meta-Learning Neural Bloom Filters; ICML, 2019.
- Readings - Feb 14, 2023:
- (Required) From WiscKey to Bourbon: A Learned Index for Log-Structured Merge Trees (OSDI 2020)
- Readings - Feb 16, 2023:
- (Required) AI Meets AI: Leveraging Query Executions to Improve Index Recommendations (SIGMOD 2019)
- (Required) Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads; PVLDB 2021
- Readings - Feb 21, 2023:
- (Required) H. Yuan, G. Li, L. Feng, and et al. Automatic view generation with deep learning and reinforcement learning. In ICDE, 2020.
- Readings - Feb 23, 2023:
- The Price of Tailoring the Index to Your Data: Poisoning Attacks on Learned Index Structures; SIGMOD 2022
- Relevant Readings:
- ALEX: An Updatable Adaptive Learned Index. SIGMOD 2020. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David Lomet, Tim Kraska
- Learning Multi-dimensional Indexes. SIGMOD 2020. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, Tim Kraska
- Stacked Filters: Learning to Filter by Structure (VLDB 2021)
- Stable Learned Bloom Filters for Data Streams (VLDB 2020)
- Tiresias: Enabling Predictive Autonomous Storage and Indexing (VLDB 2022)
- Benjamin Hilprecht, Carsten Binnig, Uwe Röhm. Towards learning a partitioning advisor with deep reinforcement learning. SIGMOD 2019.
- Weeks 6-9: Query Processing, Query Optimization
[show/hide]
- Slides/Notes: [Sorting; Joins], [Eddies],
[UCB; UCT; SkinnerDB],
[AQP; Cardinality Estimation 1]
- Brief Description: We will cover the work on adaptive query processing operators and adaptive execution engines. There is less work on "Query Processing" relative to the others, likely because the overheads are still too high. However, "Query Optimization" has seen quite a bit of work in the last few years, but the jury is still out on how well they work. This is definitely an area where the benefits of incorporating ML techniques are quite high.
- Readings - Feb 28, 2023:
- (Required) The Case for a Learned Sorting Algorithm. SIGMOD 2020. Ani Kristo, Kapil Vaidya, Ugur Cetintemel, Sanchit Misra, Tim Kraska
- (Required) The Case for Learned In-Memory Joins; Babek and Kraska; VLDB 2022 (arXiv 2021)
- Readings - March 2, 2023:
- (Required) Eddies: Continuously adaptive query processing; (SIGMOD 2000)
- Readings - March 7, 2023:
- (Required) Trummer, I., Wang, J., Maram, D., Moseley, S., Jo, S., Antonakakis, J. (n.d.). SkinnerDB : Regret-Bounded Query Evaluation via Reinforcement Learning. SIGMOD, 2019.
- Readings - March 9, 2023:
- (Required) Learned Approximate Query Processing: Make it Light, Accurate and Fast (CIDR 2021)
- (Required) Deep Unsupervised Cardinality Estimation (VLDB 2019)
- Readings - March 14, 2023:
- (Required) Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. NeuroCard: One Cardinality Estimator for All Tables. PVLDB, 14(1): 61-73, 2021
- Readings - March 16, 2023:
- (Required) Learning to optimize join queries with deep reinforcement learning; SIGMOD 2018.
- Readings - March 28, 2023:
- (Required) Bao: Learning to Steer Query Optimizers. Preprint. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska
- Readings - March 30, 2023:
- (Required) Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection (VLDB 2022)
- Relevant Readings:
- A. Dutt, C. Wang, A. Nazi, S. Kandula, V. R. Narasayya, and S. Chaudhuri. Selectivity estimation for range predicates using lightweight models. PVLDB, 12(9):1044–1057, 2019.
- A Learned Query Rewrite System using Monte Carlo Tree Search (VLDB 2022)
- Are We Ready For Learned Cardinality Estimation? (VLDB 2021)
- Learned Cardinality Estimation: An In-depth Study (SIGMOD 2022)
- An End-to-End Learning-based Cost Estimator (VLDB 2019)
- Weeks 10, 11: Natural Language to SQL
[show/hide]
- Brief Description: Generating SQL from natural language text. This is a topic where LLMs appear to do quite well out of the box, but there is also rich prior literature on this topic.
- Readings - April 4, 2023:
- (Required) ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores; VLDB 2016.
- Readings - April 6, 2023:
- (Required) Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning; ArXiv 2017.
- (Required) SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning; ArXiv 2017.
- Readings - April 11, 2023:
- (Required) RAT-SQL: Relation-aware schema encoding and linking for text-to-sql parsers; 2019.
- Readings - April 13, 2023:
- (Required) RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases; 2020.
- Readings - April 18, 2023:
- (Required) Natural language to SQL: where are we today?; 2020.
- Relevant Readings:
- Natural language to SQL: Where are we today? (VLDB 2020)
- LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning (SIGMOD 2022)
- CodexDB: Generating Code for Processing SQL Queries using GPT-3 Codex (ArXiv 2022)
- BERT Meets Relational DB: Contextual Representations of Relational Databases
- Athena++ natural language querying for complex nested sql queries; VLDB 2020
- Pre-trained Models and Table-oriented Tasks
[show/hide]
- Readings - April 20, 2023:
- (Required) TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data; ArXiv 2020.
- Readings - April 25, 2023:
- (Required) Deep Entity Matching with Pre-Trained Language Models; VLDB 2021.
- Readings - April 27, 2023:
- (Required) TCN: Table Convolutional Network for Web Table Interpretation; WWW 2021.
- Readings - May 2, 2023:
- (Required) TUTA: Tree-based Transformers for Generally Structured Table Pre-training; KDD 2021.
- (Required) Annotating Columns with Pre-trained Language Models; SIGMOD 2022.
- Readings - May 4, 2023:
- (Required) DeepJoin: Joinable Table Discovery with Pre-trained Language Models;
- Readings - May 9, 2023:
- Integrating Data Lake Tables; VLDB 2022.
- Readings - May 11, 2023:
- (Required) Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes; CIDR 2023.
- (Required) Can Foundation Models Wrangle Your Data?; ArXiv 2022.
- Miscellaneous
[show/hide]
- Relevant Readings:
- Dana Van Aken, Andrew Pavlo, et al. Automatic Database Management System Tuning Trough Large-scale
Machine Learning. In SIGMOD, 2017.
- Ji Zhang, Yu Liu, Ke Zhou, Guoliang Li et al. An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning. SIGMOD 2019.
- Learning Scheduling Algorithms for Data Processing Clusters (SIGCOMM 2019)
- Learning to dispatch for job shop scheduling via deep reinforcement learning (NeuriPS)
- J. Tan, T. Zhang, F. Li, et al. iBTune: Individualized Buffer Tuning for Large-Scale Cloud Databases. VLDB 2019.I
- Mayuresh Kunjir, Shivnath Babu. Black or White? How to Develop an AutoTuner for Memory-based Analytics. SIGMOD 2020.
|