Skip to main content

module util::Sampling

rascal-0.34.0

Utilities to randomly select smaller datasets from larger datasets

Usage

import util::Sampling;

Dependencies

import util::Math;
import Map;
import List;
import Set;

Description

Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.

function sample

Reduce the arity of a set by selecting a uniformly distributed sample.

set[&T] sample(set[&T] corpus, int target)

A uniform subset is computed by iterating over the set and skipping every element with a probability of 1/(size(corpus) / target). This rapidly generates a new set of expected target size, but most probably a little smaller or larger.

Examples

rascal>import util::Sampling;
ok
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","b","c","e","g","i","j"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","c","e","f","g","h"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","e","g","i","j"}

function sample

Reduce the length of a list by selecting a uniformly distributed sample.

list[&T] sample(list[&T] corpus, int target)

The random selection of elements does not change their initial order in the list. A uniform sublist is computed by iterating over the list and skipping every element with a probability of 1/(size(corpus) / target). This rapidly generates a new list of expected target size, but most probably a little smaller or larger.

Examples

rascal>import util::Sampling;
ok
rascal>sample([1..1000], 30)
list[int]: [21,44,68,107,112,152,171,175,256,261,262,282,330,367,396,415,431,435,459,461,520,587,590,620,621,656,660,663,706,709,782,827,898,915,953,978]
rascal>sample([1..1000], 30)
list[int]: [2,17,53,61,120,126,135,148,212,264,278,344,360,383,410,414,432,446,455,467,492,502,507,515,544,553,622,646,774,786,822,849,854,857,905,908,941,943,946,977]
rascal>sample([1..1000], 30)
list[int]: [13,16,21,30,71,105,141,221,289,292,299,311,376,396,424,429,554,583,692,751,778,796,803,805,848,887,896,905,942,959]

function sample

Reduce the size of a map by selecting a uniformly distributed sample.

map[&T,&U] sample(map[&T,&U] corpus, int target)

A uniform submap is computed by iterating over the map's keys and skipping every key with a probability of 1/(size(corpus) / target). This rapidly generates a new map of expected target size, but most probably a little smaller or larger.