Statistical and Computational Tools for Analyzing High-Dimensional Heterogeneous Data

Wang, Kaizheng (PI)

Columbia University

Proyecto

Description

Modern technologies generate tremendous volumes of data in diverse forms. The high throughput data come inevitably with great heterogeneity and enormous amount of noise. For instance, a large-scale genetic study typically involves people with various different attributes; a social network usually consists of multiple hidden communities with denser internal connections compared to external ones. While the raw features have high ambient dimension (for example, thousands of genes), oftentimes the intrinsic structures exhibit low complexity (for example, latitude and longitude of an individual’s geographic location). Precise extraction of the latent structure paves the way for solving downstream tasks. Faced with the significant challenges in statistics and computation, this project aims to develop efficient methodologies for estimating and inferring latent structures from heterogeneous data. This project will yield cutting-edge tools for scientific study, open-source software for easy implementation, and new mathematical theorems for theoretical analysis. The project will also provide numerous opportunities for statistical education and research training.The project is structured into three parts. In the first part, the goal is to develop a new flexible methodology for clustering high-dimensional data. This part aims at new algorithms that can identify non-spherical and even non-convex clusters. An in-depth analysis of mixture models brings theoretical insights including tight finite-sample statistical error bounds and finite-iteration convergence guarantees for computation. In the second part, the goal is to study heterogeneous relational data that encode the information of individual objects in their pairwise relations. This part yields reliable methods for estimating and testing latent structures in the realistic scenario where the partially observed data may not be uniformly sampled at random. Finally, the third part focuses on the joint analysis of multiple related datasets, such as social networks with high-dimensional personal attributes. The tools developed in the first two parts of the project will constitute fundamental building blocks to address the research goals of this last thrust. The research finding will provide novel efficient data integration strategies for enhanced statistical accuracy. The proposed research initiatives include dissemination of the new methods and algorithms in a form of publicly available software and an active agenda on enhancing interdisciplinary research training and enhancing diversity in statistical sciences.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Estado	Activo
Fecha de inicio/Fecha fin	8/1/22 → 7/31/25

Financiación

National Science Foundation

Keywords

Estadística y probabilidad
Matemáticas (todo)
Física y astronomía (todo)

Acceder al proyecto

https://www.nsf.gov/awardsearch/showAward?AWD_ID=2210907

Statistical and Computational Tools for Analyzing High-Dimensional Heterogeneous Data

Detalles del proyecto

Description

Financiación

Keywords

Acceder al proyecto

Huella digital