implementation of set operations

March 13, 2013 - 5 minutes read - 933 words

We got in a bit of a debate yesterday in the office over the implementation of associative containers, which I thought was pretty fun. We made up the big chart of complexity results you see below.

nomenclature:

$S$, $S_1$, and $S_2$ are subsets of $\Omega$.
Denote an element by $e\in\Omega$.
$n$,$n_1$,$n_2$,$N$ are the sizes of the set $S$, $S_1$, $S_2$, and $\Omega$, respectively, and $n_1 \geq n_2$.

Complexity

Operation\Approach	Hash Table	Hash Tree	Binary List	Entry List (sorted)	Entry List (unsorted)
$e \in S $	$O(1) $	$O(log(n)) $	$O(1) $	$O(log(n)) $	$O(n) $
$S_1 \cup S_2 $	$O(n_1+n_2) $	$O(n_1+n_2) $	$O(N) $	$O(n_1+n_2) $	$O(n_1n_2) $
$S_1 \cap S_2 $	$O(n_1) $	$O(log(n_1)n_2)$	$O(N) $	$O(n_2) $	$O(n_1n_2) $
space complexity	$O(n) $	$O(n) $	$O(N)$ bits.	$O(n) $	$O(n) $

As I said–this was just what came out of my memory of an informal discussion, so I make no guarantees that any of it is correct. Let me know if you spot something wrong! We used the examples $S_1 = {1,2,3,4,5}$ and $S_2 = {500000}$ to think through some things.

Details on Implementations

Hash Table

Assumptions: hash collisions rare/table lookups constant time.
contains: hash the element $e$, table lookup, if 1 then in set.
union: create new table $ht$, set $ht[e]=1$ for all $e$ in either set.
intersection: create new table $ht$, then for each $e\in S_2$, if $e\in S_1$ set $ht[e] = 1$.
space: Actual space used and table lookup complexity is highly dependent on the hash function : $h(x)=x$ requires $N$ bits for the table but collisions are impossible(identical to bitset); $h(x)=1$ requires only 1 bit for the table but necessitates a separate list of colliding elements(identical to sorted/unsorted list, depending on how collisions are handled.)

Hash Tree

Assumptions: hash collisions rare/tree lookups logarithmic time(well balanced trees).
contains: hash elt, tree lookup, if 1 then in set.
union: must iterate over both sets (eg, tree traversal)
intersection: lookup for each elt in n_2.
space: Each element requires storage, plus the overhead of the tree.
Implementations details: To make average runtime more likely and depending on size of set, can use an AVL, Red Black, Splay, or B-Trees instead of simple binary search trees.

Binary List ($l_e = {0,1}$)

contains: access $e^{th}$ element of vector/bitset.
union: must iterate over all items which could be in the set.
intersection: same.
space: Each element in the set $\Omega$ requires one bit.

Element List ($l_i = e$, sorted)

contains: binary search on list for $e$.
union: pointer for each list, step until both point to end.
intersection: pointer for each list, step the pointer indicating the smaller value until end of shorter list(the remaining elements in the longer list can not be in both lists.)
space: Each element in the set must be stored.

Element List ($l_i = e$, unsorted)

contains: must iterate over entire list since elt could be anywhere.
union: for each elt of n_1, will iterate over half of n_2 on average
intersection: for each elt of n_1, will iterate over half of n_2 on average.
space: Each element in the set must be stored.

Tree notes

B-Trees vs Red/Black vs AVL vs Splay:
each try to minimize the depth of a tree:
B-Trees by breaking the $\leq 2$ children per node and doing more comparisons at each level,
AVL/Splay trees by keeping the tree as balanced as possible.
B-Trees can get a performance edge from cache locality (good for disk reads, ENORMOUS amounts of data, etc)
AVL/RedBlack trees try to self-balance to minimize depth.
AVL rebalances (an $O(log n)$ operation) when a the heights of subtrees of a node differ by more than 1;
RedBlack trees rebalance (an $O(1)$ operation) when paths to leaves contain different numbers of black nodes.
AVL requires longest and shortest depth of tree to be within one(and has $O(log(n))$ work to rebalance) while RedBlack trees requires longest and shortest depth to be <=2 and has O(1) work to rebalance.
Splay trees try to to ‘splay’ commonly accessed nodes to more shallow parts of the tree so that subsequent accesses require less tree traversal.

Practical Use

In closing, I wanted to point out a few ways these implementations come up in C++ and MATLAB.

In C++, most of these options are supported. In particular:

Hash tables are a typical implementation of unordered_set
Hash trees are a typical implementation of set (also see the example on wikipedia.)
Binary lists have a space efficient implementation in bitset
Support for element-list style set specification is included in STL’s algorithm function collection under set_difference, set_intersection, set_union, and find.

MATLAB does support list-style set operations via the ismember, union, and intersect commands(as well as setdiff):

S1 = [3,2,1]			% declare list
ismember(1,S1)			% true
S2 = 2					% 1-elt lists are lists.
S3 = setdiff(S1,S2);	% remove elements of S2 from S1
S4 = union(S2,S3)		% recombine S2 and S3 (ie, S1)
S5 = union(S4,3)		% also works for elements.
S6 = intersection(S1,S2)% common elements

Interestingly, MATLAB returns sorted arrays regardless of whether they are sorted or not.

Sets can also be manipulated in either the binary list or element list formats, because MATLAB supports indexing matrices by either a binary vector or a vector of indices.

vec = 10:20;
oe1=mod(vec,2)==1		% logical array of odd elements
oe2=find(mod(vec,2)==1)	% index array of odd elements.
vec(oe1)				% slicing by logical
vec(oe2)				% slicing by index

While researching this blog entry, I was also surprised to learn that MATLAB has support for maps.

k={'Bid', 'Ask', 'Open', 'Close'};
v={10.10,10.15, 9.90, 9.95};
stock=containers.Map(k, v);
stock.keys
stock.values
stock.isKey('Close')
stock('Close')=3
stock('Close')
stock.remove('Close')

Pretty neat!