Blind feature selection: Selecting features with privacy-preserving summary statistics
The analysis of sensitive personal data, such as genetic and medical records, poses substantial privacy risks. In recent years there is a growing interest in drawing conclusions from such data using only privacy-preserving summary statistics. An important problem in such settings is finding a set of features of a bounded size that maximizes the variance of a response.
I will describe a method for such feature selection in high-dimensional linear models, given two sources of auxiliary information: (1) the covariance matrix of the features; (2) a set of annotations that describes each feature. I will describe an application of this method to genetic datasets with millions of genetic variants, which shows that 50% of the explained variance of many complex human traits can be localized to less than 5% of the common variants in the human genome.