Data is the currency for modern scientists and engineers. Data-related problems are growing in size, driving the demand for scalable learning tools and efficient optimization algorithms. In recent years, we have been encountering a vast increase in the amount of large and complex data generated through mobile devices, social networking websites, business and medical platforms. Revealing patterns and drawing insights from massive amounts of growing data results in significant advancements in various fields such as science and engineering. 

There are challenges in dealing with large-scale data using existing data analytics tasks. First, the high dimensionality and large volumes of data lead to computationally inefficient algorithms. Therefore, it is often impossible to run analytic tasks using a standalone processor. Secondly, classical learning tools require data to be stored in a centralized location, but in modern applications, data is often distributed over different locations in a network, leading to high communication costs. Finally, an extreme case is when the data rate is so large that we turn to streaming algorithms in which learning must be performed without revisiting past entries of the data. This streaming model means that an algorithm can never ask to see data again, and quite naturally presents a major challenge to engineers.

I am working on novel approaches to reduce the computational, storage, and communication burden in dealing with large-scale data sets. I focus on revealing underlying structures of large-scale data sets using various data analysis and machine learning tasks such as principal component analysis and clustering used in quantitative disciplines. The strategy is to use randomized techniques for dimensionality reduction and sampling motivated by the classical Johnson-Lindenstrauss lemma. The idea behind this approach is to provide a simple low-dimensional representation or sketch of the data that preserves key properties of the original data. The work of learning is then done using the lower-dimensional sketches instead of the original data set, leading to substantial savings of memory and computation time.