Sampling streaming data with replacement

Byung Hoon Park, George Ostrouchov, Nagiza F. Samatova

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided.

Original languageEnglish
Pages (from-to)750-762
Number of pages13
JournalComputational Statistics and Data Analysis
Volume52
Issue number2
DOIs
StatePublished - Oct 15 2007

Funding

This work is performed as part of the Scientific Data Management Center ( http://sdmcenter.lbl.gov ) under the Department of Energy's Scientific Discovery through Advanced Computing program ( http://www.scidac.org ). The work of BHP is partially supported by the US Department of Homeland Security's Advanced Scientific Computing program. The work of GO was supported by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL). This work was performed under the auspices of the US Department of Energy by Oak Ridge National Laboratory, which is managed by the University of Tennessee-Battelle, L.L.C. for the Department of Energy under contract DOE-AC05-00OR22725. This research used the resources of the Center for Computational Sciences at ORNL.

Keywords

  • Data stream mining
  • Random sampling with replacement
  • Reservoir sampling

Fingerprint

Dive into the research topics of 'Sampling streaming data with replacement'. Together they form a unique fingerprint.

Cite this