module util::Sampling
Utilities to randomly select smaller datasets from larger datasets
Usage
import util::Sampling;
Dependencies
import util::Math;
import Map;
import List;
import Set;
Description
Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.
function sample
Reduce the arity of a set by selecting a uniformly distributed sample.
set[&T] sample(set[&T] corpus, int target)
A uniform subset is computed by iterating over the set and skipping every element
with a probability of 1/(size(corpus) / target). This rapidly generates a new set of
expected target size, but most probably a little smaller or larger.
Examples
rascal>import util::Sampling;
ok
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","b","e","f","j","k"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","e"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","b","h"}
function sample
Reduce the length of a list by selecting a uniformly distributed sample.
list[&T] sample(list[&T] corpus, int target)
The random selection of elements does not change their initial order in the list.
A uniform sublist is computed by iterating over the list and skipping every element
with a probability of 1/(size(corpus) / target). This rapidly generates a new list of
expected target size, but most probably a little smaller or larger.
Examples
rascal>import util::Sampling;
ok
rascal>sample([1..1000], 30)
list[int]: [6,18,61,76,82,147,195,196,236,284,286,350,353,493,496,508,511,523,627,669,670,672,692,755,767,780,799,836,845,869,903,968,999]
rascal>sample([1..1000], 30)
list[int]: [21,156,157,162,271,290,325,360,377,386,459,484,498,517,607,626,651,673,691,693,698,758,807,834,854,899,994]
rascal>sample([1..1000], 30)
list[int]: [10,16,35,63,67,78,98,115,144,156,163,237,285,308,327,334,349,353,364,425,447,476,535,623,654,667,722,764,769,770,771,787,788,823,907,908,948,951,971,986,992]
function sample
Reduce the size of a map by selecting a uniformly distributed sample.
map[&T,&U] sample(map[&T,&U] corpus, int target)
A uniform submap is computed by iterating over the map's keys and skipping every key
with a probability of 1/(size(corpus) / target). This rapidly generates a new map of
expected target size, but most probably a little smaller or larger.