module Sampling
rascal-0.28.2
Usage
import util::Sampling;
Dependencies
import util::Math;
import Map;
import List;
import Set;
Synopsis
Utilities to randomly select smaller datasets from larger datasets
Description
Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.
function sample
Synopsis
Reduce the arity of a set by selecting a uniformly distributed sample.
Description
A uniform subset is computed by iterating over the set and skipping every element
with a probability of 1/(size(corpus) / target)
. This rapidly generates a new set of
expected target
size, but most probably a little smaller or larger.
Examples
rascal>import util::Sampling;
ok
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"c","g"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"c","h","i"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"b","e","h","j","k"}
set[&T] sample(set[&T] corpus, int target)
list[&T] sample(list[&T] corpus, int target)
map[&T,&U] sample(map[&T,&U] corpus, int target)