Getting the minimum of a dynamic set
1. INITIAL FORMULATION
We are given a set S of integers that changes dynamically over
time, with the following operations on the set:
insert: An integer that is not a duplicate of any current
member is added to S.
delete: A current member of the set is deleted.
getmin: The value of the smallest integer in the set is returned.
There is some number N such that it is guaranteed that the size
of S will never be greater than N. Other than that the
probability of inserts and deletes is equal. It is desired that
the information about the set be organized so that:
(a) the amortized cost of inserts, deletes, and getmins is O(1)
2. ADDITIONAL ASSUMPTIONS
(1) The integers in S are machine representable, i.e., there is
an IMAX such that all members of S are less than IMAX.
(2) The integers inserted are all randomly drawn from the same
probability distribution. This is a stronger assumption than is
necessary, but it simplifies analysis.
(3) All members of the set are equally likely to be deleted.
This is a very important assumption; if, for example, the
smallest element is always deleted, we get a completely different
task.
(4) The size of S over time has a mean value n, n<N. Again,
this is a stronger assumption than necessary, but it simplifies
analysis.
(5) The process has a finite duration, bounded by TMAX inserts.
The duration is "long", i.e. TMAX >> N.
3. REFINEMENT OF REQUIREMENTS
(1) The worst case cost of an operation need not be O(1). If
this is not the case then the best that can be done is some
combination of O(log n) and O(1), e.g., insert and delete costs
are O(log n) and getmin cost is O(1).
(2) There is no significant likelihood that the worst case
will occur.
(3) The amortized cost of each operation should be minimized,
i.e., we are looking for a (nearly) optimal solution.
4. DATA STRUCTURES FOR S
There are at least three obvious methods for storing the set -
using a hash table, using a trie, and using a bit vector. Hash
tables are convenient, reasonably economical in storage usage,
and are easily programmed. The disadvantages are (a) they have a
bad worst case search cost, and (b) the cost of computing hash
indices can be significant. Tries (convert the integer into a
byte sequence) are also easily programmed and have a fixed cost
for inserts and deletes. However they tend to be relatively
expensive in terms of storage cost. Using a bit vector is the
simplest and most efficient method; however it can only be used
if IMAX is "reasonable". For example, if IMAX is 2**16 then the
bit vector would consist of 2048 32 bit words; if IMAX is 2**32,
then the bit vector would be a 2 Gigabyte monster. Which method
to use is an implementation issue.
5. THE NAIVE GETMIN ALGORITHM
The naive algorithm is to store the current minimum value. If a
value smaller than the current minimum is inserted then it
becomes the new current minimum value. If the minimum value is
deleted then the entire set is scanned to determine the minimum
of the remaining values. Evidently the getmin cost is O(1).
The cost of determining the minimum after a delete of the current
minimum is O(n). However this deletion only occurs in 1/n of the
deletes, so the amortized cost of deletes is O(1).
The naive getmin algorithm meets the initial requirements;
however it would be desirable if the rescanning could be either
eliminated or be made unlikely. There are two obvious
alternatives for reducing the likelihood of rescanning: using a
trie, and keeping a set of small values.
Finding the minimum when the values are stored in a trie is a
bounded cost. A trie may well be the best choice if guaranteed
bounded costs are required. It will not considered further here.
6. THE SUBSET OF SMALL VALUES ALGORITHM
The root idea here is to keep a subset S' of the smallest values
of S; to get the minimum of S we need only to find the minimum of
S'. There are several variations of this root idea; the simplest
is to keep at most M values for some fixed M.
As long as there are less than m values in our subset we
can insert elements into it. If the subset is full (has M
elements) then we discard the largest when there is an insertion
of a "small" value.
Conversely, if the subset becomes empty we scan the entire suite
of elements to find the new subset of M smallest values. The
thought is that it will be unlikely that we ever need to scan the
entire suite if m is large enough.
There remains the question of determining whether a newly
inserted value should be added to S'. The method considered here
is to record the maximum, S'_MAX, of S' when it is full (has M
items) and use it as a test value, only adding items to S' that
are smaller than S'_MAX. The algorithm then is:
Initialization of S': Initially insert and delete items into S'
until it becomes full (has M entries). Record S'_MAX.
Thereafter:
Inserts: Check whether the item is less the current S'_MAX. If
not, store it in general storage. If it is add it to S' and
increment the size, m, of S'. If m>M then we move the current
maximum of S' to general storage, and scan S' for a new S'_MAX.
Also, if the new item is less than the current minimum then it
replaces the current minimum.
Deletes: Check whether the item is in S'; if it is not, simply
delete it. If it is in S' and it is not the current minimum,
simply delete it from S' and decrement the size of S'. If it is
the current minimum and m, the size of S', is greater than 1,
delete it, decrement the size of S', and scan S' for the new
minimum. Finally, if m=1, i.e., we are deleting the last member
of S', then we scan all of S to get the m smallest entries and
use them to form a new S'.
Under the assumption that the distribution of inserted items
comes from a well defined probability distribution, there will be
some value M_TEST such that if there are n items in S, then there
will be M items in S' on average. S'_MAX is an estimate of
M_TEST.
7. ANALYSIS OF PERFORMANCE
The amortized costs of inserts, deletes, and getmins are all
O(1) with a worst case for deletes of O(n). There remains the
question of how likely it is for it to be necessary to rescan the
entire set, S, given a subset size, M.
Suppose that at a given moment in time there are k members in S',
k <= M. Suppose further that there are n members in S. Let t(k)
be the expected time (where time is measured as the number of
inserts and deletes) that S' is unchanged. During a course of 2n
events (n inserts, n deletes) we can expect M inserts and k
deletes. Then we have t(k) = 2n/(M+k).
Over time the number of elements in S' will go up or down until
it either overflows or underflows. Let T(k) be the total time
(on average) in which there are k members in S'. We have
T(k) = t(k) + (M/(M+k-1))*T(k-1) + ((k+1)/(M+k+1))*T(k+1)
(2 < k < M)
T(M) = t(M) + (M/(2M-1))*T(M-1)
T(1) = t(2) + (2/(M+2))*T(2)
The average lifetime of S' before an underflow or overflow is
SUM = sum(T(k), K=1...M). Most "failures" will be overflows
which have a cost of at most M. The average lifetime of S'
before an underflow is:
T_U = SUM*(T(M)/2 + T(1)/(M+1))/(T(1)/(M+1))
The recurrence equations are easily solved. The following table
lists SUM/n and T_U/n for various values of M.
M SUM/n T_U/n
2 1.39 3.55
3 2.77 9.48
4 4.56 22.24
5 6.60 48.02
6 8.77 96.70
7 10.97 181.75
8 13.16 318.50
9 15.36 521.29
10 17.58 801.59
15 28.97 3664.51
20 41.06 9875.70
25 53.76 20893.85
30 66.99 38266.69
40 94.74 98550.09
50 123.90 204157.76
60 154.21 369176.35
70 185.50 608220.03
80 217.65 936344.59
90 250.54 1368985.88
100 284.11 1921913.55
200 646.78 17718911.07
500 1893.83 326620455.63
1000 4225.63 2922059700.00
If we desire M to be large enough so that it is unlikely that
there be an underflow then we need to pick it so that T_U > TMAX.
Given the rate of growth of T_U as a function of M, it doesn't
need to be much larger.
This page was last updated January 1, 2004. |