Abstract
Random sampling is a widely accepted basis for estimation from large data sets that outstrip available computer memory. When the data comes as a stream, its total size is potentially infinite and usually only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. All reservoir schemes that have been introduced in the past are random sampling without replacement; no duplicates are allowed in a sample. This paper introduces a new method for reservoir sampling with replacement. We first prove that the proposed method indeed maintains a random sample with replacement at any given time. Then we introduce a refined version that significantly speeds up the overall sampling procedure.
| Original language | English |
|---|---|
| Pages | 492-496 |
| Number of pages | 5 |
| DOIs | |
| State | Published - 2004 |
| Event | Proceedings of the Fourth SIAM International Conference on Data Mining - Lake Buena Vista, FL, United States Duration: Apr 22 2004 → Apr 24 2004 |
Conference
| Conference | Proceedings of the Fourth SIAM International Conference on Data Mining |
|---|---|
| Country/Territory | United States |
| City | Lake Buena Vista, FL |
| Period | 04/22/04 → 04/24/04 |
Keywords
- Data stream mining
- Probabilistic method
- Random sampling
- Reservoir sampling