CMSC848E: Machine Learning for Data Management Systems

Amol Deshpande;        Tue-Thu 12:30pm-1:45pm


[Home] [Schedule] [Assignments] [Resources]


Description:

Machine learning techniques have been used to improve specific components of data management systems for a couple of decades now, but that trend has really accelerated in the last 5-6 years for several reasons.
  • Although database systems have always had many tuning knobs, usually it was possible to reason about them (and set them) manually. However, the newer systems built over the last 20 years are more complex and are often distributed and/or cloud-based, and the running environments are usually changing quite frequently, making it harder to tune them manually.
  • The data characteristics are changing more rapidly and often, queries/analytics are run on data without any pre-analysis to compute statistics. The query/analytics workloads are also more unpredictable today.
  • ML training/inference is faster, by orders of magnitude in many cases, through use of specialized hardware, making it feasible to use these techniques where it didn't make sense in the past.
  • The success of newer ML models like Large Language Models (LLMs) have opened up possibilities to improve many additional aspects of data management, e.g., natural language querying and even database design.
The goal of this class is to better understand the recent literature on using ML to improve data management systems (broadly defined), with an emphasis on the deep learning-based approaches, and in particular, on where they may succeed as well as the limitations/failure modes. A tentative list of topics that we will cover (expect some changes over the semester):
  • Background on database systems and deep learning (≈ 2 weeks): A brief overview of the key components of a traditional database system, and newer data management systems (specifically, Apache Spark). Prior work on auto-tuning. A brief overview of deep neural networks and reinforcement learning. See pre-requisites below.
  • Learned indexes and Storage Layouts (≈ 2-3 weeks): Learned indexes are a new class of indexes that use machine learning algorithms to improve the performance and functionality of "search". This topic has probably seen the most work in the recent years as the problems can be defined cleanly (relatively speaking) and indexes have less complex interactions with/dependencies on other modules in the system. We will also cover some of the work on optimizing storage layouts.
  • Query Processing (≈ 2-3 weeks): We will cover the work on adaptive query processing operators and adaptive execution engines. There is less work on this topic relative to the others, likely because the overheads are still too high.
  • Query Optimization (≈ 2-3 weeks): This topic has seen quite a bit of work in the last few years, but the jury is still out on how well they work. This is definitely an area where the benefits of incorporating ML techniques are quite high.
  • Natural Language to SQL (≈ 2-3 weeks): Generating SQL from natural language text. This is a topic where LLMs appear to do quite well out of the box, but there is also rich prior literature on this topic.
  • Workload Forecasting and Resource Management (≈ 2-3 weeks): We will cover some of the recent work on forecasting workloads and using it to tune various knobs in a system.
You are welcome to focus on other topics/components for your projects/assignments (see below) as long as the topic has a significant data management systems component (e.g., data lakes).

Pre-requisites: Although we will spend some time reviewing background on data management systems as well as deep learning, it is not possible for us to cover that background in depth. If you don't already have familiarity with those topics, you may have to spend additional time catching up on the background. We will primarily be using ML techniques as blackboxes, so a deep understanding of the foundations is not required (we will largely cover ML techniques as we encounter them in the papers, rather than covering them in the beginning). For databases, outside of the "query optimization" topic, it should be relatively easy to get up to speed on the required background. However, you should at least have familiarity with SQL and have used it in the past. You can go through CMSC424 assignments (CMSC 424 Fall 2022) if you want to refresh your knowledge.

Course Grading:

The grading will be based on class participation + paper summaries (20%), assignments (30%), take-home final (20%), and a class project (30%). More details on the assignments tab.

Office Hours:

By appointment.

Approach:

This is a research-oriented seminar course, and will be based on reading, and discussing papers from recent conferences.

We will not be recording the lectures since they are intended to be discussion-heavy, but I will try to post notes after every class.

The course counts as a PhD and MS qualifying course in Databases.

Class forum:

We will use Slack for class communications and discussions: Link to Join.