Reservoir-based random sampling with replacement from data stream

Byung Hoon Park, George Ostrouchov, Nagiza F. Samatova, Al Geist

Research output: Contribution to conferencePaperpeer-review

15 Scopus citations

Abstract

Random sampling is a widely accepted basis for estimation from large data sets that outstrip available computer memory. When the data comes as a stream, its total size is potentially infinite and usually only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. All reservoir schemes that have been introduced in the past are random sampling without replacement; no duplicates are allowed in a sample. This paper introduces a new method for reservoir sampling with replacement. We first prove that the proposed method indeed maintains a random sample with replacement at any given time. Then we introduce a refined version that significantly speeds up the overall sampling procedure.

Original languageEnglish
Pages492-496
Number of pages5
DOIs
StatePublished - 2004
EventProceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States
Duration: Apr 22 2004Apr 24 2004

Conference

ConferenceProceedings of the Fourth SIAM International Conference on Data Mining
Country/TerritoryUnited States
CityLake Buena Vista, FL
Period04/22/0404/24/04

Keywords

  • Data stream mining
  • Probabilistic method
  • Random sampling
  • Reservoir sampling

Fingerprint

Dive into the research topics of 'Reservoir-based random sampling with replacement from data stream'. Together they form a unique fingerprint.

Cite this