# module util::Sampling

rascal-0.34.0

Utilities to randomly select smaller datasets from larger datasets

#### Usage​

``import util::Sampling;``

#### Dependencies​

``import util::Math;import Map;import List;import Set;``

#### Description​

Sampling is important when the analysis algorithms do not scale to the size of the original corpus, or when you need to train an analysis on a representative set without overfitting on the entire corpus. These sampling functions all assume that a uniformly random selection is required.

## function sample​

Reduce the arity of a set by selecting a uniformly distributed sample.

``set[&T] sample(set[&T] corpus, int target)``

A uniform subset is computed by iterating over the set and skipping every element with a probability of `1/(size(corpus) / target)`. This rapidly generates a new set of expected `target` size, but most probably a little smaller or larger.

#### Examples​

``rascal>import util::Sampling;okrascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)set[str]: {"a","b","c","f","g","i","k"}rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)set[str]: {"c","g","h","i","k"}rascal>sample({"a","b","c","e","f","g","h","i","j","k"}, 4)set[str]: {"a","b","c","e","f","g","h"}``

## function sample​

Reduce the length of a list by selecting a uniformly distributed sample.

``list[&T] sample(list[&T] corpus, int target)``

The random selection of elements does not change their initial order in the list. A uniform sublist is computed by iterating over the list and skipping every element with a probability of `1/(size(corpus) / target)`. This rapidly generates a new list of expected `target` size, but most probably a little smaller or larger.

#### Examples​

``rascal>import util::Sampling;okrascal>sample([1..1000], 30)list[int]: [53,91,108,132,147,194,226,233,236,280,286,287,292,296,302,484,506,563,581,605,609,616,620,648,715,755,764,773,787,864,865,875,890,929,938]rascal>sample([1..1000], 30)list[int]: [7,30,35,40,47,57,71,114,116,126,131,180,204,222,254,292,293,300,303,313,337,347,420,435,450,523,525,543,601,610,617,656,693,784,799,840,841,882,927]rascal>sample([1..1000], 30)list[int]: [37,59,62,79,142,204,207,293,308,329,334,356,361,367,372,409,451,510,527,568,686,689,868,915,940,949,951,985]``

## function sample​

Reduce the size of a map by selecting a uniformly distributed sample.

``map[&T,&U] sample(map[&T,&U] corpus, int target)``

A uniform submap is computed by iterating over the map's keys and skipping every key with a probability of `1/(size(corpus) / target)`. This rapidly generates a new map of expected `target` size, but most probably a little smaller or larger.