Detalles del proyecto
Description
A growing number of machine learning applications involve selecting subsets of data. Examples include selecting smaller subsets from a much larger dataset to label (to save labeling costs) and to train (to reduce computational costs), or selecting a summary of a video or a photo collection to ease viewing by a person. Submodularity is a natural way to address these problems because it naturally models many aspects like diversity, representation, and coverage. In this project, the PIs will study a rich class of submodular information measures that model not only diversity, representation, coverage but also constructs such as relevance and irrelevance to certain target concepts. One application of this is selecting a data summary with certain user specifications -- e.g., a summary relevant to a given query or under a privacy constraint (a photo summary relevant to a specific person or one which avoids certain personal information). Another application is to interactively select data samples to label in the presence of rare classes or while avoiding outliers (e.g., cancerous images as rare classes for medical imaging tasks). Advances in this field can have implications in many areas including data summarization, reducing labeling efforts (in tasks like medical imaging), and reducing the carbon footprint for training deep learning models on massive datasets.
The underlying mathematical model proposed in this project is a rich class of functions called ``submodular information measures``, which includes submodular mutual information, submodular conditional gain, submodular multi-set mutual information, directed submodular mutual information, and combinatorial independence. Specifically, the PIs will investigate and develop: (1) rich theoretical properties and instantiations of these submodular information measures; (2) optimization algorithms, approximation bounds, and hardness results of the associated optimization problems; (3) applications of the submodular information measures in data summarization, data subset selection, active learning, clustering, and diversified partitioning. While pursuing these activities, the PIs will involve undergraduate and under-represented high-school students in this research to inspire them to pursue careers in AI/ML and other STEM-related fields.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
Estado | Finalizado |
---|---|
Fecha de inicio/Fecha fin | 9/15/21 → 8/31/24 |
Financiación
- National Science Foundation: $1,068,261.00
Keywords
- Inteligencia artificial
- Informática (todo)