Skip to main content

module util::Sampling

rascal-0.34.0

Utilities to randomly select smaller datasets from larger datasets

Usage

import util::Sampling;

Dependencies

import util::Math;
import Map;
import List;
import Set;

Description

Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.

function sample

Reduce the arity of a set by selecting a uniformly distributed sample.

set[&T] sample(set[&T] corpus, int target)

A uniform subset is computed by iterating over the set and skipping every element with a probability of 1/(size(corpus) / target). This rapidly generates a new set of expected target size, but most probably a little smaller or larger.

Examples

rascal>import util::Sampling;
ok
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","b","c","f","g","i","k"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"c","g","h","i","k"}
rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)
set[str]: {"a","b","c","e","f","g","h"}

function sample

Reduce the length of a list by selecting a uniformly distributed sample.

list[&T] sample(list[&T] corpus, int target)

The random selection of elements does not change their initial order in the list. A uniform sublist is computed by iterating over the list and skipping every element with a probability of 1/(size(corpus) / target). This rapidly generates a new list of expected target size, but most probably a little smaller or larger.

Examples

rascal>import util::Sampling;
ok
rascal>sample([1..1000], 30)
list[int]: [53,91,108,132,147,194,226,233,236,280,286,287,292,296,302,484,506,563,581,605,609,616,620,648,715,755,764,773,787,864,865,875,890,929,938]
rascal>sample([1..1000], 30)
list[int]: [7,30,35,40,47,57,71,114,116,126,131,180,204,222,254,292,293,300,303,313,337,347,420,435,450,523,525,543,601,610,617,656,693,784,799,840,841,882,927]
rascal>sample([1..1000], 30)
list[int]: [37,59,62,79,142,204,207,293,308,329,334,356,361,367,372,409,451,510,527,568,686,689,868,915,940,949,951,985]

function sample

Reduce the size of a map by selecting a uniformly distributed sample.

map[&T,&U] sample(map[&T,&U] corpus, int target)

A uniform submap is computed by iterating over the map's keys and skipping every key with a probability of 1/(size(corpus) / target). This rapidly generates a new map of expected target size, but most probably a little smaller or larger.