December 7, 2011
11:00AM – 4164 Harold Frank Hall
Ambuj Singh (chair)
Amr El Abbadi
Title: Querying Patterns in High Dimensional Heterogeneous Datasets
Existing search applications have not only become an essential part of our lives, but are also at the core of a multi-billion dollar industry. This unprecedented success has fueled further research and development of search applications for new emerging domains and far more sophisticated queries. To efficiently answer a query, objects in a dataset are transformed into high dimensional feature vectors and indexed into a data structure. It is highly desirable that the search algorithms are generalizable, scalable, efficient, and accurate for dataset of any size and dimension. The state-of-the-art search techniques achieve these properties only for simpler queries on homogeneous datasets of low dimensions. It is a very challenging task to develop search algorithms for a high dimensional heterogeneous dataset that simultaneously achieve all these properties.
My goal is to develop algorithms for answering multi-point queries in high dimensional heterogeneous datasets. I show that with sophisticated data transformations, novel data structures, and a careful design of the search steps, it is possible to develop search algorithms that satisfactorily achieve all the desired properties.
I will start my talk with a discussion on the useful queries inspired by the emerging datasets. Next, I will highlight the inherent challenges in answering these queries both accurately and efficiently. Finally, I will discuss a novel approach, ProMiSH, for solving a very useful problem of the nearest keyword set search in high dimensional spatial-textual datasets. I show that ProMiSH outperforms state-of-the-art methods by multiple orders of magnitude.