Report ID
2000-16
Report Authors
Christian A. Lang and Ambuj K. Singh
Report Date
Abstract
A large number of index structures for high-dimensional data have beenproposed previously. In order to tune and compare such index structures,it is vital to have efficient cost prediction techniques for thesestructures. Previous techniques either assume uniformity of the data orare not applicable to high-dimensional data. We propose the use ofsampling to predict the number of accessed index pages during a queryexecution. Sampling is independent of the dimensionality and preservesclusters which is important for representing skewed data. We present ageneral model for estimating the index page layout using sampling andshow how to compensate for errors. We then give an implementation ofour model under restricted memory assumptions and show that it performswell even under these constraints. Errors are minimal and the overallprediction time is up to two orders of magnitude below the time forbuilding and probing the full index without sampling.
Document
2000-16.ps527.87 KB