Project Details
Description
Machine-learning tools have become ubiquitous in modern information systems. The data inputs used by these tools often originate from relational databases. The data outputs generated by those tools are often stored in databases, where they can be used for subsequent data analysis. Typically, however, the learning process itself is performed outside the database system. This project investigates the opportunity for performing more of the machine learning work within the database itself, avoiding expensive (and often redundant) data export and import. In partnership with researchers from Relational-AI and Microsoft, the Columbia University team will design and build two interacting open-source systems named MARQUE and ZORK. These systems will make data analysis more efficient and effective for database-resident information. Improved efficiency will lead to faster, more cost-effective machine learning, and executing ML within the DBMS will simplify operational complexity and benefit from DBMS features such as scalability, access control, and data management. Ultimately, this work will broaden the adoption of machine learning technologies in a wide range of data-intensive disciplines.MARQUE will be a database management system that supports machine learning primitives such as linear algebra operations within the context of a query processing engine. The system will efficiently compile SQL queries using embedded machine learning models within the database, combining state-of-the-art query processing techniques with highly engineered linear algebra algorithms. MARQUE will allow components of the machine-learning pipeline itself to be formulated as in-database operations, avoiding unnecessary data copying. Conventional SQL analytic queries that can be reformulated using extensions of operators like matrix multiplication can be optimized to use efficient execution plans involving specialized algorithms for such operators. To further support in-database machine learning, the project investigators will build ZORK, a system to support machine learning at scale that will make use of the infrastructure provided by MARQUE. ZORK will scale to very large datasets by processing factorized representations of the data rather than explicitly materializing large joins. This project will develop new and innovative query processing techniques for queries involving both conventional relational operators and generalized linear algebra operators. Tight integration will facilitate query optimization within and between operators. Using this system, a range of machine learning techniques will be developed that operate entirely within the database management system, avoiding data export and simplifying concerns such as data privacy administration.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Status | Active |
---|---|
Effective start/end date | 7/1/23 → 6/30/27 |
ASJC Scopus Subject Areas
- Artificial Intelligence
- Algebra and Number Theory
- Computer Networks and Communications
- Engineering(all)
- Computer Science(all)
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.