Abstract
Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided.
Original language | English |
---|---|
Pages (from-to) | 750-762 |
Number of pages | 13 |
Journal | Computational Statistics and Data Analysis |
Volume | 52 |
Issue number | 2 |
DOIs | |
State | Published - Oct 15 2007 |
Funding
This work is performed as part of the Scientific Data Management Center ( http://sdmcenter.lbl.gov ) under the Department of Energy's Scientific Discovery through Advanced Computing program ( http://www.scidac.org ). The work of BHP is partially supported by the US Department of Homeland Security's Advanced Scientific Computing program. The work of GO was supported by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL). This work was performed under the auspices of the US Department of Energy by Oak Ridge National Laboratory, which is managed by the University of Tennessee-Battelle, L.L.C. for the Department of Energy under contract DOE-AC05-00OR22725. This research used the resources of the Center for Computational Sciences at ORNL.
Keywords
- Data stream mining
- Random sampling with replacement
- Reservoir sampling