Realistic quantification of potential privacy loss from genomic summary results

Gursoy, Gamze G (PI)

Columbia University Irving Medical Center

Project: Research project

Description

PROJECT ABSTRACT With the surge of large genomics data, there is an immense increase in the breadth and depth of different genomics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of genotype-phenotype relationships; hence, there is great desire to share data as broadly as possible. The recent change in NIH policy of sharing genomic summary results is a great step towards making the data available to broader researchers. However, privacy studies inferring study participations is outdated compared to the pace of the technological advancements in genome sequencing. A key first step in reducing private information leakage is to measure the amount of information leakage, particularly under different scenarios. To this end, we propose to derive information- theoretic measures for private information leakage in different genomic data sharing scenarios, especially when the datasets are noisy and incomplete. We will also develop various risk assessment tools. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage in different data sharing scenarios as well as under the conditions when the genotype data is imperfect. We will systematically analyze the risk of inference of study participation of a patient. Second, we will design a plausible privacy attack through an experimental study, in which different technologies will be used to sequence genomes from trace amount of samples such as touch objects or used glasses. This will allow us to study the plausible scenarios of surreptious DNA testing and its effect on genomic data sharing. Third, we will develop risk assessment tools for sharing genomic summary results. These tools will simulate hundreds of scenarios learned through simulations in aim 1 and real-life privacy attacks in aim 2 to quantify the risks before the release of the data. These tools will be implemented using cryptographic techniques to further reduce the private information leakage during risk assessment step. During the K99 phase, the aim of this project is to find minimum amount of genotyping information required and maximum amount of noise tolerated for detection of a genome in a mixture using simulations and wet-lab experiments. To accomplish this research goal, the K99 phase will involve training in molecular biology, genomics and privacy. This training will take place at Yale University in the department of Molecular Biophysics and Biochemistry, under the mentorship of Dr. Mark Gerstein (genomics and privacy) and Dr. Andrew Miranker (molecular biology). Building on the training during the K99, the goal of the R00 phase will be simulation of the results of the experimental training to increase the sample size and building privacy risk assessment tools with the results learned from the experiment and simulations and implementation of such tools using cryptographic techniques.

Status	Finished
Effective start/end date	5/9/22 → 4/30/24

Funding

National Human Genome Research Institute: US$239,858.00
National Human Genome Research Institute: US$249,000.00

ASJC Scopus Subject Areas

Genetics
Molecular Biology

Access Project

https://projectreporter.nih.gov/project_info_details.cfm?aid=10616768

Realistic quantification of potential privacy loss from genomic summary results

Project Details

Description

Funding

ASJC Scopus Subject Areas

Access Project

Fingerprint