Abstract
Random sampling is a widely accepted basis for estimation from large data sets that outstrip available computer memory. When the data comes as a stream, its total size is potentially infinite and usually only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. All reservoir schemes that have been introduced in the past are random sampling without replacement; no duplicates are allowed in a sample. This paper introduces a new method for reservoir sampling with replacement. We first prove that the proposed method indeed maintains a random sample with replacement at any given time. Then we introduce a refined version that significantly speeds up the overall sampling procedure.
Original language | English |
---|---|
Pages | 492-496 |
Number of pages | 5 |
DOIs | |
State | Published - 2004 |
Event | Proceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States Duration: Apr 22 2004 → Apr 24 2004 |
Conference
Conference | Proceedings of the Fourth SIAM International Conference on Data Mining |
---|---|
Country/Territory | United States |
City | Lake Buena Vista, FL |
Period | 04/22/04 → 04/24/04 |
Keywords
- Data stream mining
- Probabilistic method
- Random sampling
- Reservoir sampling