CMPUT 692-A1
Topics in Data Management with LLMs
Fall 2025
Meetings: Mon & Wed, 11:00–12:20
Instructor: Davood Rafiei, UCOMM 7-130
Large Language Models (LLMs) are increasingly applied to tasks that were once labor-intensive or difficult to automate. This course explores their emerging role in addressing core challenges in data management. We begin with foundations in databases and LLMs, then study their points of intersection—focusing on models, algorithms, and systems that enable scalable, practical solutions for managing large datasets and high-volume workloads.
Topics to be Covered (Tentative)
DBDatabase Foundations
- Query languages: First-order logic, conjunctive queries, SQL
- Data integration: Models and techniques
LLMLarge Language Models
- LLM Basics: Reasoning methods (chain-of-thought, self-consistency, tree-of-thought, RAG)
- Foundations: Probabilistic/statistical view, including n-gram models
- Reinforcement Learning: Preference optimization (RLHF, RLAIF) and improved reasoning
- Agents & Verticals: Tool use, planning, and domain-specific applications
DB & LLMNatural Language Interfaces to Databases
- Text-to-SQL generation
- Benchmarking and evaluation (e.g., Spider, BIRD)
- Table-based Question Answering (Table-QA)
- Data-to-Text generation
DB & LLMData Integration with LLMs
- Data wrangling and cleaning
- Program synthesis for table transformations
- Entity resolution and matching
DBScaling Retrieval
- Similarity search over structured & unstructured data
- Models of relevance: lexical and semantic
- Top-k query processing
DBExample-Based Queries
- Querying by examples: methods, applications, limitations
Course Prerequisites
Students are expected to have a background in introductory data management and/or information retrieval (e.g., CMPUT 291 or equivalent) or be willing to learn these fundamentals. They should also have some knowledge of probability and statistics, along with demonstrated programming proficiency. Programming experience should include working with data analysis tools and libraries (e.g., JSON, CSV) and familiarity with scripting or coding for tasks such as LLM inference and basic model training.
Grading (Tentative)
- 36% — Assignments: problem sets, programming exercises, and research paper reviews
- 44% — Term project (individual or groups of 2, depending on class size)
- 15% — Class presentation of a research paper
- 5% — Participation in class discussions
Recommended Books and Resources
- Xiao, Zhu: Foundations of Large Language Models, arXiv:2501.09223, 2025.
- Abiteboul, Hull, Vianu: Foundations of Databases, Addison-Wesley, 1995.
- Li, Radev, Rafiei: Natural Language Interfaces to Databases, Springer Nature, 2023.
- Leskovec, Rajaraman, Ullman: Mining of Massive Datasets, 3rd ed. Cambridge UP, 2014.
- Relevant research papers (to be announced)