CMPUT 692-A1

Topics in Data Management with LLMs

Fall 2025

Meetings: Mon & Wed, 11:00–12:20
Instructor: Davood Rafiei, UCOMM 7-130

Large Language Models (LLMs) are increasingly applied to tasks that were once labor-intensive or difficult to automate. This course explores their emerging role in addressing core challenges in data management. We begin with foundations in databases and LLMs, then study their points of intersection—focusing on models, algorithms, and systems that enable scalable, practical solutions for managing large datasets and high-volume workloads.

Topics to be Covered (Tentative)

DBDatabase Foundations

Query languages: First-order logic, conjunctive queries, SQL
Data integration: Models and techniques

LLMLarge Language Models

LLM Basics: Reasoning methods (chain-of-thought, self-consistency, tree-of-thought, RAG)
Foundations: Probabilistic/statistical view, including n-gram models
Reinforcement Learning: Preference optimization (RLHF, RLAIF) and improved reasoning
Agents & Verticals: Tool use, planning, and domain-specific applications

DB & LLMNatural Language Interfaces to Databases

Text-to-SQL generation
Benchmarking and evaluation (e.g., Spider, BIRD)
Table-based Question Answering (Table-QA)
Data-to-Text generation

DB & LLMData Integration with LLMs

Data wrangling and cleaning
Program synthesis for table transformations
Entity resolution and matching

DBScaling Retrieval

Similarity search over structured & unstructured data
Models of relevance: lexical and semantic
Top-k query processing

DBExample-Based Queries

Querying by examples: methods, applications, limitations

Course Prerequisites

Students are expected to have a background in introductory data management and/or information retrieval (e.g., CMPUT 291 or equivalent) or be willing to learn these fundamentals. They should also have some knowledge of probability and statistics, along with demonstrated programming proficiency. Programming experience should include working with data analysis tools and libraries (e.g., JSON, CSV) and familiarity with scripting or coding for tasks such as LLM inference and basic model training.

Grading (Tentative)

36% — Assignments: problem sets, programming exercises, and research paper reviews
44% — Term project (individual or groups of 2, depending on class size)
15% — Class presentation of a research paper
5% — Participation in class discussions

Recommended Books and Resources

Xiao, Zhu: Foundations of Large Language Models, arXiv:2501.09223, 2025.
Abiteboul, Hull, Vianu: Foundations of Databases, Addison-Wesley, 1995.
Li, Radev, Rafiei: Natural Language Interfaces to Databases, Springer Nature, 2023.
Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, 3rd ed. Cambridge UP, 2014.
Relevant research papers (to be announced)