18: random and probabilistic data structures

Linear Congruential Generator | Random Data Structure Overview | Treap
Fisher-Yates Shuffle | Random Assignment | Bloom Filter | Random Optimization

Linear Congruential Generator

When you use rand() in C/C++, they aren't really random numbers, it is a string of numbers that is based on a formula in the standard library.

A linear congruential generator is a popular random number generator:

X(n+1) = (a * X(n) + c) mod m
where:
  • a: 0 < a < m
  • c: 0 <= c < m
  • X(0): 0 < m
Requirements of c, a, and m:
  • c and m should be relatively prime (the only positive integer that evenly divides them both is one)
    • "For example, 14 and 15 are coprime, being commonly divisible by only 1, but 14 and 21 are not, because they are both divisible by 7." [2]
  • a - 1 must be divisible by all prime factors of m
  • a - 1 must be a multiple of 4 if m is a multiple of 4
glibc (used by gcc) uses these parameters:
  • m = 2^32
  • a = 1103515245
  • c = 12345
Visual C/C++ and Apple CarbonLib use a linear congruential generator too.

This means I should be able to perfectly predict the random numbers coming from a gcc rand function if I know the seed or the previous random number output.

In fact, in the "Berkley Open Infrastructure for Network Computing" (BOINC), they save the state of the random number generator by saving off the last random number generated. Then to restore, that number is put in as the seed.

Demonstration of rand predictibility:
1:  
// randReplay.cpp - download here

2:  

3:  
#include
 
<
cstdlib>

4:  
#include
 
<
iostream>

5:  

6:  
void
 
printRandomNumbers(
)

7:  
{

8:  
 
 
std:
:
cout 
<
<
 
"five random numbers between 0 and 10:"
 
<
<
 
std:
:
endl;

9:  
 
 
for
(
int
 
=
 
0;
 
<
 
5;
 
++i)
{

10: 
 
 
 
 
std:
:
cout 
<
<
 
rand(
)
 
10 
<
<
 
" "
;

11: 
 
 
}

12: 
 
 
std:
:
cout 
<
<
 
std:
:
endl;

13: 
}

14: 

15: 
int
 
main(
int
 
argc, 
char
 
argv[
]
)
{

16: 

17: 
 
 
int
 
seed 
=
 
20;

18: 

19: 
 
 
srand(
seed)
;

20: 
 
 
printRandomNumbers(
)
;

21: 

22: 
 
 
srand(
seed)
;

23: 
 
 
printRandomNumbers(
)
;

24: 

25: 
 
 
//prints:

26: 
 
 
//

27: 
 
 
//  five random numbers between 0 and 10:

28: 
 
 
//  1 8 7 9 6 

29: 
 
 
//

30: 
 
 
//  five random numbers between 0 and 10:

31: 
 
 
//  1 8 7 9 6 

32: 

33: 
 
 
return
 
0;

34: 
}

  • If better random numbers are needed, a Mersenne twister algorithm has a much larger period

Environment based random numbers

Linux and MAC OS X contain the special file: /dev/random. It collects environmental noise from device drivers and "other sources" and uses these as seeds to produce random bytes [4]. This can serve random numbers that are good enough for cryptographic functions. If /dev/random runs out of bytes and there is no new noise, reads from /dev/random block until there is new noise (on linux).

Hardware Random Number Generators

Hardware Random Number Generators use things like thermal noise, the photoelectric effect or other quantum phenomena [5] to produce random numbers.

Random Data Structure Overview

  • List: Skip List
  • Hashtable: Perfect Hashing
  • Tree: Treap

Treap

Treaps are a randomized binary tree. The are a combiniation of a max heap and a binary tree.

Insertion:
  • Start: Treap
  • Insert value with random priority: Treap
  • Do rotations until there is a max heap again: Treap
Time complexity:
  • O(lgn) amortized

Fisher-Yates Shuffle [7]

Fisher-Yates Shuffle can create a random permutation of a finite set:
1: 
algorithm
 
fisherYatesShuffle(
vector 
vec)

2: 
 
 
for
 
vec.size(
)
 
to 
1

3: 
 
 
 
 
random 
number 
between 
and 
i

4: 
 
 
 
 
exchange 
vec[
i]
 
and 
vec[
j]

start: [0 1 2 3 4 5]
i = 5, j = 2 [0 1 5 3 4 2]
i = 4, j = 0 [4 1 5 3 0 2]
i = 3, j = 2 [4 1 3 5 0 2]
i = 2, j = 1 [4 3 1 5 0 2]
i = 1, j = 0 [3 4 1 5 0 2]

Randomly Assign to Bins

You want each bin to have an equal number of random items:
1:  
// randProjectAssignment.cpp - download here

2:  

3:  
#include
 
<
cstdlib>

4:  
#include
 
<
iostream>

5:  
#include
 
<
vector>

6:  
#include
 
<
string>

7:  
#include
 
<
sstream>

8:  
#include
 
<
ctime>

9:  
#include
 
<
algorithm
>

10: 

11: 
std:
:
vector<
std:
:
vector<
std:
:
string>
 
>
 
assign(
int
 
num_people, 
std:
:
vector<
std:
:
string>
 
names)

12: 
{

13: 
 
 
std:
:
vector<
int
>
 
people;

14: 
 
 
for
(
int
 
=
 
0;
 
<
 
names.size(
)
;
 
++i)
{

15: 
 
 
 
 
people.push_back(
num_people)
;

16: 
 
 
}

17: 

18: 
 
 
std:
:
random_shuffle(
people.begin(
)
people.end(
)
)
;
 

19: 

20: 
 
 
std:
:
vector<
std:
:
vector<
std:
:
string>
 
>
 
ret;

21: 
 
 
ret.resize(
num_people)
;

22: 

23: 
 
 
for
(
int
 
=
 
0;
 
<
 
people.size(
)
;
 
++i)
{

24: 
 
 
 
 
int
 
id 
=
 
people[
i]
;

25: 
 
 
 
 
ret[
id]
.push_back(
names[
i]
)
;

26: 
 
 
}

27: 
 
 

28: 
 
 
return
 
ret;

29: 
}

30: 

31: 
std:
:
string 
toString(
int
 
i)

32: 
{

33: 
 
 
std:
:
ostringstream 
oss;

34: 
 
 
oss 
<
<
 
i;

35: 
 
 
return
 
oss.str(
)
;

36: 
}

37: 

38: 
int
 
main(
int
 
argc, 
char
 
argv[
]
)

39: 
{

40: 

41: 
 
 
std:
:
vector<
std:
:
string>
 
names;

42: 
 
 
for
(
int
 
=
 
0;
 
<
 
10;
 
++i)
{

43: 
 
 
 
 
names.push_back(
toString(
i)
)
;

44: 
 
 
}

45: 

46: 
 
 
std:
:
vector<
std:
:
vector<
std:
:
string>
 
>
 
groups 
=
 
assign(
3, 
names)
;

47: 
 
 
for
(
size_t
 
=
 
0;
 
<
 
groups.size(
)
;
 
++i)
{

48: 
 
 
 
 
std:
:
cout 
<
<
 
"group: "
 
<
<
 
<
<
 
std:
:
endl;

49: 
 
 
 
 
std:
:
vector<
std:
:
string>
 
curr 
=
 
groups[
i]
;

50: 
 
 
 
 
for
(
size_t
 
=
 
0;
 
<
 
curr.size(
)
;
 
++j)
{

51: 
 
 
 
 
 
 
std:
:
cout 
<
<
 
"  name: "
 
<
<
 
curr[
j]
 
<
<
 
std:
:
endl;

52: 
 
 
 
 
}

53: 
 
 
}

54: 

55: 
 
 
//group: 0

56: 
 
 
//  name: 1

57: 
 
 
//  name: 4

58: 
 
 
//  name: 8

59: 
 
 
//  name: 9

60: 
 
 
//group: 1

61: 
 
 
//  name: 0

62: 
 
 
//  name: 2

63: 
 
 
//  name: 7

64: 
 
 
//group: 2

65: 
 
 
//  name: 3

66: 
 
 
//  name: 5

67: 
 
 
//  name: 6

68: 

69: 
 
 
return
 
0;

70: 
}

Bloom Filter [8]

A bloom filter tells whether an element is in a set. It says either:
  • The item may be in the set
  • The item is not in the set
It uses k hash functions and an array of binary bits. 10 bits per element are needed to ensure 1% false positive probability [9].

To insert, feed the item to the k hash functions to get k bits to set in the table. Then set these bits. Removal is possible in a counting filter that contains a count rather than a bit.

The hash function could be:
  • hash_k(item) = (k + item) % table_size
  • Seed a random number generator with a hash result. Take the k successive items from rand()
To search, all bits must be one after hashing.

Notable uses:
  • Apache Casandra: reduce disk lookups for non-existant rows by containing all keys in a bloom filter [11]
  • Google Chrome: identify malicious URLs. first check is against bloom filter then full check is against full database.

Evolutionary Programming

In evolutionary programming, a model is fixed, but the constants are randomly mutated until a solution is found
1:  
algorithm
 
evolve

2:  
 
 
seeds 
generateRandomSeeds(
)
;

3:  
 
 
calculate 
cost;

4:  
 
 
while
(
convergence 
not 
met 
of 
seeds)

5:  
 
 
 
 
take 
the 
best 
seeds 
and 
add 
them 
to 
next_seeds;

6:  
 
 
 
 
mutate 
the 
best 
seeds 
and 
add 
them 
to 
next_seeds;

7:  
 
 
 
 
generate 
random 
new
 
seeds 
and 
add 
them 
to 
next_seeds;

8:  
 
 
 
 
seeds 
next_seeds;

9:  
 
 
 
 
calculate 
cost 
of 
seeds;

10: 
 
 
output 
constants 
for
 
best 
seed;

Line Parameter Estimation Example:
1:   
// evolutionaryLine.cpp - download here

2:   

3:   
#include
 
<
vector>

4:   
#include
 
<
iostream>

5:   
#include
 
<
algorithm
>

6:   
#include
 
<
cstdio>

7:   
#include
 
<
cstdlib>

8:   
#include
 
<
ctime>

9:   
#include
 
<
cmath>

10:  

11:  
struct
 
Seed 
{

12:  
 
 
int
 
m;

13:  
 
 
int
 
b;

14:  
 
 
int
 
cost;

15:  

16:  
 
 
bool
 
operator<
(
const
 
Seed&
 
rhs)
 
const
;

17:  
}
;

18:  

19:  
bool
 
Seed:
:
operator<
(
const
 
Seed&
 
rhs)
 
const

20:  
{

21:  
 
 
return
 
cost 
<
 
rhs.cost;

22:  
}

23:  

24:  
struct
 
Point 
{

25:  
 
 
int
 
x;

26:  
 
 
int
 
y;

27:  
 
 
Point(
int
 
x, 
int
 
y)
;

28:  
}
;

29:  

30:  
Point:
:
Point(
int
 
x, 
int
 
y)
{

31:  
 
 
this->
=
 
x;

32:  
 
 
this->
=
 
y;

33:  
}

34:  

35:  
class
 
EvolutionaryLine 
{

36:  
public
:

37:  
 
 
Seed 
solve(
std:
:
vector<
Point>
 
points)
;

38:  
private
:

39:  
 
 
void
 
fixSeeds(
)
;

40:  
 
 
void
 
initSeeds(
)
;

41:  
 
 
int
 
calculateScore(
Seed 
seed)
;

42:  
 
 
void
 
mutate(
)
;

43:  
 
 
void
 
calculateScores(
)
;

44:  
 
 
std:
:
vector<
Point>
 
m_Points;

45:  
 
 
std:
:
vector<
Seed>
 
m_Seeds;

46:  
 
 
int
 
m_ConvergeCount;

47:  
 
 
Seed 
m_BestSeed;

48:  
}
;

49:  

50:  
int
 
EvolutionaryLine:
:
calculateScore(
Seed 
seed)
{

51:  
 
 
int
 
ret 
=
 
0;

52:  
 
 
for
(
size_t
 
=
 
0;
 
<
 
m_Points.size(
)
;
 
++i)
{

53:  
 
 
 
 
Point 
curr 
=
 
m_Points[
i]
;

54:  
 
 
 
 
int
 
=
 
(
seed.m 
curr.x)
 
seed.b;

55:  
 
 
 
 
int
 
score 
=
 
curr.y 
y;

56:  
 
 
 
 
score 
*=
 
score;

57:  
 
 
 
 
ret 
+=
 
score;

58:  
 
 
}

59:  
 
 
return
 
ret;

60:  
}

61:  

62:  
Seed 
EvolutionaryLine:
:
solve(
std:
:
vector<
Point>
 
points)
{

63:  
 
 
m_Points 
=
 
points;

64:  
 
 
initSeeds(
)
;

65:  
 
 
while
(
m_BestSeed.cost 
>
 
1)
{

66:  
 
 
 
 
std:
:
cout 
<
<
 
"cost: "
 
<
<
 
m_BestSeed.cost 
<
<
 
std:
:
endl;

67:  
 
 
 
 
mutate(
)
;

68:  
 
 
 
 
calculateScores(
)
;

69:  
 
 
}

70:  
 
 
return
 
m_BestSeed;

71:  
}

72:  

73:  
void
 
EvolutionaryLine:
:
fixSeeds(
)
{

74:  
 
 
for
(
size_t
 
=
 
0;
 
<
 
m_Seeds.size(
)
;
 
++i)
{

75:  
 
 
 
 
Seed 
curr 
=
 
m_Seeds[
i]
;

76:  
 
 
 
 
while
(
curr.m 
=
=
 
0)
{

77:  
 
 
 
 
 
 
curr.m 
+=
 
(
rand(
)
 
4)
 
2;

78:  
 
 
 
 
}

79:  
 
 
 
 
m_Seeds[
i]
 
=
 
curr;

80:  
 
 
}

81:  
}

82:  

83:  
void
 
EvolutionaryLine:
:
mutate(
)
{

84:  
 
 
std:
:
vector<
Seed>
 
new_seeds;

85:  
 
 
new_seeds.push_back(
m_BestSeed)
;

86:  

87:  
 
 
for
(
int
 
=
 
0;
 
<
 
10;
 
++i)
{

88:  
 
 
 
 
Seed 
curr 
=
 
m_Seeds[
i]
;

89:  
 
 
 
 
curr.m 
+=
 
(
rand(
)
 
4)
 
2;

90:  
 
 
 
 
curr.b 
+=
 
(
rand(
)
 
4)
 
2;

91:  
 
 
 
 
new_seeds.push_back(
curr)
;

92:  
 
 
}

93:  

94:  
 
 
for
(
int
 
=
 
0;
 
<
 
10;
 
++i)
{

95:  
 
 
 
 
Seed 
seed;

96:  
 
 
 
 
seed.m 
=
 
(
rand(
)
 
10)
 
5;

97:  
 
 
 
 
seed.b 
=
 
(
rand(
)
 
10)
 
5;

98:  
 
 
 
 
new_seeds.push_back(
seed)
;

99:  
 
 
}

100: 

101: 
 
 
m_Seeds 
=
 
new_seeds;

102: 
 
 
fixSeeds(
)
;

103: 
}

104: 

105: 
void
 
EvolutionaryLine:
:
calculateScores(
)

106: 
{

107: 
 
 
for
(
size_t
 
=
 
0;
 
<
 
m_Seeds.size(
)
;
 
++i)
{

108: 
 
 
 
 
Seed 
curr 
=
 
m_Seeds[
i]
;

109: 
 
 
 
 
curr.cost 
=
 
calculateScore(
curr)
;

110: 
 
 
 
 
m_Seeds[
i]
 
=
 
curr;

111: 
 
 
}

112: 

113: 
 
 
sort(
m_Seeds.begin(
)
m_Seeds.end(
)
)
;

114: 
 
 
Seed 
new_best 
=
 
m_Seeds[
0]
;

115: 
 
 
if
(
new_best.m 
=
=
 
m_BestSeed.m 
&
&
 
new_best.b 
=
=
 
m_BestSeed.b)
{

116: 
 
 
 
 
m_ConvergeCount++;

117: 
 
 
}
 
else
 
{

118: 
 
 
 
 
m_ConvergeCount 
=
 
1;

119: 
 
 
 
 
m_BestSeed 
=
 
new_best;

120: 
 
 
}

121: 
}

122: 

123: 
void
 
EvolutionaryLine:
:
initSeeds(
)

124: 
{

125: 
 
 
srand(
time(
NULL)
)
;

126: 
 
 
for
(
int
 
=
 
0;
 
<
 
10;
 
++i)
{

127: 
 
 
 
 
Seed 
seed;

128: 
 
 
 
 
seed.m 
=
 
(
rand(
)
 
10)
 
5;

129: 
 
 
 
 
seed.b 
=
 
(
rand(
)
 
10)
 
5;

130: 
 
 
 
 
m_Seeds.push_back(
seed)
;

131: 
 
 
}

132: 
 
 
fixSeeds(
)
;

133: 
 
 
for
(
size_t
 
=
 
0;
 
<
 
m_Seeds.size(
)
;
 
++i)
{

134: 
 
 
 
 
Seed 
curr 
=
 
m_Seeds[
i]
;

135: 
 
 
 
 
curr.cost 
=
 
calculateScore(
curr)
;

136: 
 
 
 
 
m_Seeds[
i]
 
=
 
curr;

137: 
 
 
}

138: 
 
 
sort(
m_Seeds.begin(
)
m_Seeds.end(
)
)
;

139: 
 
 
m_BestSeed 
=
 
m_Seeds[
0]
;

140: 
 
 
m_ConvergeCount 
=
 
1;

141: 
}

142: 

143: 
int
 
main(
int
 
argc, 
char
 
argv[
]
)
{

144: 

145: 
 
 
std:
:
vector<
Point>
 
points;

146: 
 
 
points.push_back(
Point(
1, 
0)
)
;

147: 
 
 
points.push_back(
Point(
2, 
1)
)
;

148: 
 
 
points.push_back(
Point(
3, 
2)
)
;

149: 
 
 
points.push_back(
Point(
4, 
3)
)
;

150: 
 
 
points.push_back(
Point(
5, 
4)
)
;

151: 
 
 
points.push_back(
Point(
6, 
5)
)
;

152: 
 
 
points.push_back(
Point(
8, 
6)
)
;

153: 

154: 
 
 
EvolutionaryLine 
evoline;

155: 
 
 
Seed 
seed 
=
 
evoline.solve(
points)
;

156: 

157: 
 
 
std:
:
cout 
<
<
 
"seed.m: "
 
<
<
 
seed.m 
<
<
 
" seed.b: "
 
<
<
 
seed.b 
<
<
 
std:
:
endl;

158: 
 
 
//prints:

159: 
 
 
//  cost: 186

160: 
 
 
//  cost: 33

161: 
 
 
//  seed.m: 1 seed.b: -1

162: 

163: 
 
 
return
 
0;

164: 
}

References

  1. http://en.wikipedia.org/wiki/Linear_congruential_generator
  2. http://en.wikipedia.org/wiki/Relatively_prime
  3. http://en.wikipedia.org/wiki/Treap
  4. http://en.wikipedia.org/wiki//dev/random
  5. http://en.wikipedia.org/wiki/Hardware_random_number_generator
  6. http://en.wikipedia.org/wiki/Randomized_algorithm
  7. http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
  8. http://en.wikipedia.org/wiki/Bloom_filter
  9. Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An Improved Construction for Counting Bloom Filters", Algorithms – ESA 2006, 14th Annual European Symposium, Lecture Notes in Computer Science 4168, pp. 684–695
  10. http://wiki.apache.org/cassandra/ArchitectureSSTable