Report ID
2005-23
Report Date
Abstract
We propose an integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream. The proposed technique is efficient and exact if the alphabet under consideration is small. In the more practical large alphabet case, our solution is space efficient and reports both top-k and frequent elements with tight guarantees on errors. For general data distributions, our top-k algorithm can return a set of k\' elements, where k\' is approximately k, which are guaranteed to be the top-k\' elements; and it uses minimal space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for the frequent elements problem decreases dramatically with the parameter of the distribution; and for top-k queries, the analysis ensures that only the top-k elements, in the correct order, are reported. The experiments, using real and synthetic datasets, show significant space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach, through both analysis and experiments, we extend it to be able to answer continuous queries about top-k and frequent elements. Although the problems of incremental reporting of top-k and frequent elements are useful in many applications, to the best of our knowledge, no solution has been previously proposed.
Document
2005-23.pdf402.13 KB