With the growing scale and complexity of data sets in scientific disciplines, traditional data analysis methods are no longer practical to extract meaningful information and patterns due to several reasons. Massive high-dimensional data sets that arise in applications such as computer vision, social network analysis, and computational biology are too large to fit in the main memory of a single machine, and must reside in the slower external memory. These large high-dimensional data sets might be also corrupted by noise, outliers, and missing observations that force us to reconsider existing model assumptions. Moreover, in time-sensitive applications, e.g., analyzing news articles or medical records, data samples are often presented sequentially over time in a streaming fashion, and only a small fraction of the input data, not the entire data set, should be stored and processed in a timely manner. Thus, the massive sample size and high dimensionality of modern data sets pose unique computational and statistical challenges to data analysis techniques.

My research revolves around identifying fundamental tradeoffs between memory, computational, and statistical efficiency in analyzing modern data sets with the goal of developing practical data analysis methods. I use various tools from computer science, applied mathematics, and statistics to provide mathematical guarantees on the quality of learning algorithms with limited resources, i.e., memory and time. The rigorous characterization of these tradeoffs is then used to design efficient methods that learn the underlying structure of modern data sets in the form of subspaces, clusters, and manifolds with a variety of applications in scientific disciplines. In particular, I work on randomized dimension reduction techniques, such as random sampling and random projection, to embed high-dimensional data into a lower dimensional space, and provide sketches that preserve key statistical properties in this process. Data analysis methods are then directly applied to the sketches, instead of the original data, in order to achieve substantial savings of memory and reduction of computational costs. The outcomes of my research are to provide a solid foundation of modern data analysis methods, and apply these techniques to a wide range of domain applications via multidisciplinary collaborations.