Project Details
Description
PROJECT ABSTRACT
With the surge of large genomics data, there is an immense increase in the breadth and depth of different
genomics datasets and an increasing importance in the topic of privacy of individuals in genomic data science.
Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale
mining of genotype-phenotype relationships; hence, there is great desire to share data as broadly as possible.
The recent change in NIH policy of sharing genomic summary results is a great step towards making the data
available to broader researchers. However, privacy studies inferring study participations is outdated compared
to the pace of the technological advancements in genome sequencing. A key first step in reducing private
information leakage is to measure the amount of information leakage, particularly under different scenarios. To
this end, we propose to derive information- theoretic measures for private information leakage in different
genomic data sharing scenarios, especially when the datasets are noisy and incomplete. We will also develop
various risk assessment tools. We will approach the privacy analysis under three aims. First, we will develop
statistical metrics that can be used to quantify the sensitive information leakage in different data sharing
scenarios as well as under the conditions when the genotype data is imperfect. We will systematically analyze
the risk of inference of study participation of a patient. Second, we will design a plausible privacy attack
through an experimental study, in which different technologies will be used to sequence genomes from trace
amount of samples such as touch objects or used glasses. This will allow us to study the plausible scenarios of
surreptious DNA testing and its effect on genomic data sharing. Third, we will develop risk assessment tools
for sharing genomic summary results. These tools will simulate hundreds of scenarios learned through
simulations in aim 1 and real-life privacy attacks in aim 2 to quantify the risks before the release of the data.
These tools will be implemented using cryptographic techniques to further reduce the private information
leakage during risk assessment step.
During the K99 phase, the aim of this project is to find minimum amount of genotyping information required and
maximum amount of noise tolerated for detection of a genome in a mixture using simulations and wet-lab
experiments. To accomplish this research goal, the K99 phase will involve training in molecular biology,
genomics and privacy. This training will take place at Yale University in the department of Molecular Biophysics
and Biochemistry, under the mentorship of Dr. Mark Gerstein (genomics and privacy) and Dr. Andrew Miranker
(molecular biology). Building on the training during the K99, the goal of the R00 phase will be simulation of the
results of the experimental training to increase the sample size and building privacy risk assessment tools with
the results learned from the experiment and simulations and implementation of such tools using cryptographic
techniques.
Status | Finished |
---|---|
Effective start/end date | 5/9/22 → 4/30/24 |
Funding
- National Human Genome Research Institute: US$239,858.00
- National Human Genome Research Institute: US$249,000.00
ASJC Scopus Subject Areas
- Genetics
- Molecular Biology
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.