Set (Set ADT) — Junior Level¶
Table of Contents¶
- Introduction
- What is a Set?
- ADT vs Implementation
- Real-World Analogies
- Core Concepts
- Distinct Elements
- No Defined Order
- Membership Test
- Equality and Hashing
- Core Operations
- Set-Algebra Operations
- Big-O Summary
- Pros and Cons
- Full Implementation — HashSet
- Worked Examples
- Common Mistakes
- Cheat Sheet
- Visual Animation
- Summary
Introduction¶
A set is one of the oldest ideas in mathematics and one of the most useful tools in everyday programming. It is a collection of distinct elements with no inherent order. If you have ever written if user_id in banned_users, removed duplicates from a list, or computed the intersection of two collections, you have used a set.
This topic introduces the Set Abstract Data Type (ADT) — the contract — and three concrete implementations: HashSet, TreeSet, and BitSet. We focus on what a set is, not just how one particular library represents one. By the end of this page you will be able to choose the right set implementation for the job and write one yourself.
We deliberately keep hash-table internals (chaining, probing, hash-function design) out of scope here — those live in topic 05-hash-tables. A HashSet simply uses a hash table; we treat it as a building block.
What is a Set?¶
Formally, a set S over some universe U is a subset of U where each element appears at most once. Sets answer two questions extremely well:
- Membership: "Is
xinS?" - Uniqueness: "Have I seen
xbefore?"
S = { 3, 7, 1, 9 } ← order does not matter
{1, 3, 7, 9} == {3, 7, 1, 9} ← these are the same set
{1, 1, 2, 3} == {1, 2, 3} ← duplicates collapse
A set has no notion of position. You cannot ask for "element 0" or "the third element"; you can only ask whether something is a member, iterate over all members, or combine sets with set-algebra operations like union and intersection.
ADT vs Implementation¶
This is the single most important distinction in this topic.
| Layer | Concern | Example |
|---|---|---|
| ADT (interface) | What operations exist and what they mean | add, remove, contains, size, union |
| Implementation | How operations are realized in memory | HashSet, TreeSet, BitSet, LinkedHashSet |
The Set ADT promises: - add(x): insert x; no effect if already present. - remove(x): delete x; no effect if absent. - contains(x): returns true iff x is currently a member. - size(): number of distinct elements. - iterate(): visit every member exactly once, in some order.
Three different implementations all satisfy this contract but trade off speed, memory, and ordering differently:
| Implementation | Backing structure | Membership | Iteration order | Best when |
|---|---|---|---|---|
| HashSet | Hash table | O(1) average | Undefined | General-purpose, big data |
| TreeSet | Balanced BST | O(log n) | Sorted | Need sorted output or range queries |
| BitSet | Bit vector | O(1) | Index order | Small dense integer universe |
| LinkedHashSet | Hash table + linked list | O(1) average | Insertion order | Need predictable iteration |
You should always reach for the ADT first ("I need a set") and pick the implementation last ("this set should preserve insertion order, so LinkedHashSet").
Real-World Analogies¶
1. Guest List for a Party¶
A guest list tells you whether someone is invited (membership) but does not order anyone. Inviting Alice twice does not put her on the list twice. Two guest lists can be combined (union for a joint party) or intersected (who is invited to both parties).
2. The Set of Books You Own¶
You either own a particular book or you do not. Owning a book "twice" is meaningless for the question "do I own this book?". Two collectors can ask which books they both own (intersection) or which books one has that the other does not (difference).
3. Tags on a Blog Post¶
A post is tagged ["python", "performance", "python"] — the duplicate python is redundant. The effective tag set is {python, performance}. Searching for posts tagged python is a membership query.
4. Visited Locations on a Map¶
While exploring a maze, the squares you have already visited form a set. Order does not matter — you only need to ask "have I been here?". Without this set, graph search algorithms (BFS, DFS) loop forever on cycles.
5. Bouncer at a Nightclub¶
The bouncer holds a banned-list. When someone tries to enter, the bouncer checks the list: present means rejected. They do not care how many times you have shown up before; you are either banned or you are not.
Where analogies break: real-world lists often do care about order or duplicates. The mathematical set is a strict abstraction — push that abstraction onto the data and the right code falls out.
Core Concepts¶
Distinct Elements¶
A set never stores two equal elements. Adding an element that is already present is a no-op (the set is unchanged, though some APIs return false to indicate "already there"). This is the uniqueness invariant.
No Defined Order¶
The Set ADT promises iteration, but does not promise an order. Two important pitfalls:
- A
HashSet's iteration order is unspecified and may change between versions of the language, JVM, runtime, or even between runs (Go intentionally randomizes map iteration). - A
TreeSet's iteration order is defined — sorted by the comparator — but only because the implementation goes further than the ADT requires.
If your code depends on iteration order, write that requirement down and pick a TreeSet or LinkedHashSet explicitly.
Membership Test¶
contains(x) is the central operation. Most other set operations are built on top of it. For example, computing A ∩ B reduces to iterating one set and testing membership in the other.
Equality and Hashing¶
How does the set decide that two elements are "the same"? Through two related contracts:
- Equality: two elements are equal iff they compare as equal under the language's equality operator (
==in Go for primitives,.equals()in Java,__eq__in Python). - Hashing: two equal elements must produce the same hash code (Java/Python). If
a.equals(b)thenhash(a) == hash(b), but the converse need not hold.
Break either contract and the set silently misbehaves: elements get lost, duplicates appear, or contains returns false for elements you just inserted. This is the single biggest source of bugs in set code.
// Java: bad — hashCode() uses field x, equals() uses fields x AND y.
// Two objects with same x but different y compare unequal yet hash equal.
// HashSet stores both, but contains() of one finds the other. Chaos.
Core Operations¶
| Operation | Description |
|---|---|
add(x) | Insert x. No effect if already present. |
remove(x) | Delete x. No effect if absent. |
contains(x) | True iff x is currently a member. |
size() | Number of distinct elements. |
isEmpty() | Equivalent to size() == 0. |
clear() | Remove all elements. |
iterate() | Visit each member once. |
equals(other) | True iff the two sets have exactly the same members. |
isSubset(other) | True iff every member of self is in other. |
isDisjoint(other) | True iff no member of self is in other. |
Set-Algebra Operations¶
Sets combine to form new sets through algebra borrowed straight from mathematics.
| Operation | Symbol | Meaning |
|---|---|---|
| Union | A ∪ B | members of A or B (or both) |
| Intersection | A ∩ B | members of both A and B |
| Difference | A \ B | members of A that are not in B |
| Symmetric difference | A △ B | members of exactly one of A or B |
| Cartesian product | A × B | set of ordered pairs (a, b) |
A worked example:
A = {1, 2, 3, 4}
B = {3, 4, 5, 6}
A ∪ B = {1, 2, 3, 4, 5, 6}
A ∩ B = {3, 4}
A \ B = {1, 2}
A △ B = {1, 2, 5, 6}
A × B = {(1,3),(1,4),(1,5),(1,6),(2,3),...,(4,6)} ← 16 pairs
These mirror SQL's UNION, INTERSECT, and EXCEPT — relational databases are themselves built around set algebra.
Big-O Summary¶
For a HashSet (assuming a good hash function and load factor below ~0.75):
| Operation | Average | Worst |
|---|---|---|
add(x) | O(1) | O(n) |
remove(x) | O(1) | O(n) |
contains(x) | O(1) | O(n) |
size() | O(1) | O(1) |
iterate() | O(n) | O(n) |
A ∪ B | O( | A |
A ∩ B | O(min( | A |
A \ B | O( | A |
A △ B | O( | A |
For union/intersection, the trick is always to iterate the smaller set and probe the larger one — that is what gives O(min(|A|, |B|)) for intersection on average.
For a TreeSet, replace average and worst with O(log n) per operation, and add O(n) for iteration in sorted order. For a BitSet over universe size U, each operation is O(U/64) — usually treated as O(1) for fixed-size universes.
Pros and Cons¶
| Pros | Cons |
|---|---|
| O(1) average membership tests | No ordering (unless you choose TreeSet / LinkedHashSet) |
| Automatic deduplication | Cannot index by position |
| Direct mapping to set algebra | Elements must be hashable (HashSet) or comparable (TreeSet) |
| Composes cleanly with other sets | Mutable elements break the contract silently |
Use a set when: the question is "have I seen this?", "is this unique?", "what do these collections have in common?", or "what is the union/intersection?".
Do not use a set when: you need ordering by insertion, positional access, duplicates that count (use a multiset / bag — see topic 07-multiset-bag), or key-value pairs (use a map — see topic 08-map-dictionary).
Full Implementation — HashSet¶
Below is a from-scratch HashSet that delegates collision handling to a built-in hash table (map, HashMap, dict). The point is to study the set layer, not to re-implement the hash table.
Go Implementation¶
package set
import "fmt"
// HashSet is a generic-by-string set built on Go's built-in map.
// We use struct{} as the value type because it occupies zero bytes.
type HashSet struct {
data map[string]struct{}
}
// New returns an empty HashSet.
func New() *HashSet {
return &HashSet{data: make(map[string]struct{})}
}
// Add inserts x. Returns true if x was newly added, false if already present.
func (s *HashSet) Add(x string) bool {
if _, ok := s.data[x]; ok {
return false
}
s.data[x] = struct{}{}
return true
}
// Remove deletes x. Returns true if x was present.
func (s *HashSet) Remove(x string) bool {
if _, ok := s.data[x]; !ok {
return false
}
delete(s.data, x)
return true
}
// Contains reports whether x is a member.
func (s *HashSet) Contains(x string) bool {
_, ok := s.data[x]
return ok
}
// Size returns the number of elements.
func (s *HashSet) Size() int { return len(s.data) }
// Clear removes all elements.
func (s *HashSet) Clear() { s.data = make(map[string]struct{}) }
// ForEach calls f for each member. Iteration order is intentionally random.
func (s *HashSet) ForEach(f func(string)) {
for k := range s.data {
f(k)
}
}
// Union returns a new set containing all members of s or other.
func (s *HashSet) Union(other *HashSet) *HashSet {
out := New()
s.ForEach(func(x string) { out.Add(x) })
other.ForEach(func(x string) { out.Add(x) })
return out
}
// Intersect returns a new set containing members in both s and other.
// We iterate the smaller set for efficiency.
func (s *HashSet) Intersect(other *HashSet) *HashSet {
small, large := s, other
if other.Size() < s.Size() {
small, large = other, s
}
out := New()
small.ForEach(func(x string) {
if large.Contains(x) {
out.Add(x)
}
})
return out
}
// Difference returns s \ other.
func (s *HashSet) Difference(other *HashSet) *HashSet {
out := New()
s.ForEach(func(x string) {
if !other.Contains(x) {
out.Add(x)
}
})
return out
}
// SymmetricDifference returns (s \ other) ∪ (other \ s).
func (s *HashSet) SymmetricDifference(other *HashSet) *HashSet {
out := s.Difference(other)
other.Difference(s).ForEach(func(x string) { out.Add(x) })
return out
}
// Equals reports whether s and other have exactly the same members.
func (s *HashSet) Equals(other *HashSet) bool {
if s.Size() != other.Size() {
return false
}
for k := range s.data {
if !other.Contains(k) {
return false
}
}
return true
}
// IsSubset reports whether every member of s is in other.
func (s *HashSet) IsSubset(other *HashSet) bool {
for k := range s.data {
if !other.Contains(k) {
return false
}
}
return true
}
// IsDisjoint reports whether s and other share no members.
func (s *HashSet) IsDisjoint(other *HashSet) bool {
small, large := s, other
if other.Size() < s.Size() {
small, large = other, s
}
for k := range small.data {
if large.Contains(k) {
return false
}
}
return true
}
// String renders the set for debugging.
func (s *HashSet) String() string {
return fmt.Sprintf("Set%v", s.data)
}
Java Implementation¶
import java.util.HashMap;
import java.util.Iterator;
import java.util.function.Consumer;
/**
* A minimal HashSet built on top of Java's HashMap.
* Demonstrates the Set ADT contract — the underlying hash table
* is treated as a black box.
*/
public class MyHashSet<E> implements Iterable<E> {
private static final Object PRESENT = new Object();
private final HashMap<E, Object> data = new HashMap<>();
public boolean add(E x) {
return data.put(x, PRESENT) == null;
}
public boolean remove(E x) {
return data.remove(x) != null;
}
public boolean contains(E x) {
return data.containsKey(x);
}
public int size() { return data.size(); }
public boolean isEmpty() { return data.isEmpty(); }
public void clear() { data.clear(); }
@Override
public Iterator<E> iterator() { return data.keySet().iterator(); }
public void forEach(Consumer<E> f) { data.keySet().forEach(f); }
public MyHashSet<E> union(MyHashSet<E> other) {
MyHashSet<E> out = new MyHashSet<>();
this.forEach(out::add);
other.forEach(out::add);
return out;
}
public MyHashSet<E> intersect(MyHashSet<E> other) {
MyHashSet<E> small = this, large = other;
if (other.size() < this.size()) { small = other; large = this; }
MyHashSet<E> out = new MyHashSet<>();
for (E x : small) if (large.contains(x)) out.add(x);
return out;
}
public MyHashSet<E> difference(MyHashSet<E> other) {
MyHashSet<E> out = new MyHashSet<>();
for (E x : this) if (!other.contains(x)) out.add(x);
return out;
}
public MyHashSet<E> symmetricDifference(MyHashSet<E> other) {
MyHashSet<E> out = this.difference(other);
for (E x : other.difference(this)) out.add(x);
return out;
}
public boolean equalsSet(MyHashSet<E> other) {
if (this.size() != other.size()) return false;
for (E x : this) if (!other.contains(x)) return false;
return true;
}
public boolean isSubset(MyHashSet<E> other) {
for (E x : this) if (!other.contains(x)) return false;
return true;
}
public boolean isDisjoint(MyHashSet<E> other) {
MyHashSet<E> small = this, large = other;
if (other.size() < this.size()) { small = other; large = this; }
for (E x : small) if (large.contains(x)) return false;
return true;
}
@Override
public String toString() { return "Set" + data.keySet(); }
}
Python Implementation¶
"""A minimal HashSet built on top of Python's dict."""
from typing import Generic, Iterable, Iterator, TypeVar
T = TypeVar("T")
class MyHashSet(Generic[T]):
"""Set ADT backed by a dict. dict keys already guarantee uniqueness."""
__slots__ = ("_data",)
def __init__(self, iterable: Iterable[T] | None = None) -> None:
self._data: dict[T, None] = {}
if iterable is not None:
for x in iterable:
self._data[x] = None
# --- core operations ---
def add(self, x: T) -> bool:
"""Insert x. Return True if newly added, False if already present."""
if x in self._data:
return False
self._data[x] = None
return True
def remove(self, x: T) -> bool:
"""Delete x. Return True if it was present."""
return self._data.pop(x, _MISSING) is not _MISSING
def contains(self, x: T) -> bool:
return x in self._data
def __contains__(self, x: T) -> bool:
return x in self._data
def __len__(self) -> int:
return len(self._data)
def __iter__(self) -> Iterator[T]:
return iter(self._data)
def clear(self) -> None:
self._data.clear()
# --- set algebra ---
def union(self, other: "MyHashSet[T]") -> "MyHashSet[T]":
out = MyHashSet[T]()
out._data.update(self._data)
out._data.update(other._data)
return out
def intersect(self, other: "MyHashSet[T]") -> "MyHashSet[T]":
small, large = (self, other) if len(self) <= len(other) else (other, self)
out = MyHashSet[T]()
for x in small:
if x in large:
out.add(x)
return out
def difference(self, other: "MyHashSet[T]") -> "MyHashSet[T]":
out = MyHashSet[T]()
for x in self:
if x not in other:
out.add(x)
return out
def symmetric_difference(self, other: "MyHashSet[T]") -> "MyHashSet[T]":
out = self.difference(other)
for x in other.difference(self):
out.add(x)
return out
# --- predicates ---
def equals(self, other: "MyHashSet[T]") -> bool:
return len(self) == len(other) and all(x in other for x in self)
def is_subset(self, other: "MyHashSet[T]") -> bool:
return all(x in other for x in self)
def is_disjoint(self, other: "MyHashSet[T]") -> bool:
small, large = (self, other) if len(self) <= len(other) else (other, self)
return all(x not in large for x in small)
def __repr__(self) -> str:
return "Set{" + ", ".join(repr(x) for x in self._data) + "}"
_MISSING = object()
Worked Examples¶
Example: Deduplicate a list¶
This is one line in every modern language:
Example: Common interests between two users¶
alice_likes = {"hiking", "python", "coffee", "jazz"}
bob_likes = {"coffee", "tea", "python", "running"}
common = alice_likes ∩ bob_likes = {"coffee", "python"}
Example: BFS visited-set¶
visited = ∅
queue = [start]
while queue:
v = queue.pop()
if v in visited: continue
visited.add(v)
for n in neighbors(v):
if n not in visited: queue.push(n)
Without the visited set, the algorithm loops forever on any cyclic graph.
Common Mistakes¶
| Mistake | Why it goes wrong |
|---|---|
| Storing mutable objects whose fields participate in hashing | After mutation, contains cannot find the object — it now hashes to a different bucket. |
Implementing equals without hashCode (or vice versa) | Two equal objects with different hashes can both live in the set; contains finds neither reliably. |
| Assuming iteration order is stable | HashSet in Go, Java, and Python may return elements in different orders on different runs. |
Using == (reference equality) when equals is needed | Two new Integer(5) references are not ==, but they should be set-equal. |
Forgetting that add(x) is a no-op if x is already present | Sets do not bump a counter — use a multiset for that. |
Storing NaN in a HashSet | NaN != NaN, so the same NaN value may insert twice (Java/Python behave differently). |
Including null/nil accidentally | Some implementations forbid it (Java TreeSet throws); others permit one entry. |
| Computing intersection by iterating the larger set | Twice the work; always iterate the smaller side. |
Cheat Sheet¶
| Need | Pick |
|---|---|
| Fastest membership, no ordering needed | HashSet |
| Sorted iteration / range queries | TreeSet |
| Tight memory over a dense integer range | BitSet |
| Predictable insertion-order iteration | LinkedHashSet |
| Distributed / approximate membership | Bloom filter, HyperLogLog, Redis SET |
| Element counts matter | Multiset / Bag (see topic 07) |
| Operation | HashSet | TreeSet | BitSet |
|---|---|---|---|
add | O(1) avg | O(log n) | O(1) |
remove | O(1) avg | O(log n) | O(1) |
contains | O(1) avg | O(log n) | O(1) |
| iteration | O(n) unordered | O(n) sorted | O(U) |
| memory | O(n) | O(n) | O(U/8 bytes) |
Visual Animation¶
See
animation.htmlfor an interactive visualization of two side-by-side sets withAdd,Remove,Containscontrols andUnion,Intersect,Difference,Symmetric Differencebuttons that compute a result set.
The animation demonstrates: - Two sets A and B rendered as element pills. - Live size statistics and the last operation performed. - A Big-O table for set-algebra operations. - An operation log that records every action.
Summary¶
A set is a collection of distinct elements with no defined order. Treat it as an ADT first — add, remove, contains, plus the algebra of union, intersection, and difference — and pick an implementation second. HashSet is the default; reach for TreeSet when you need ordering and BitSet when the universe is small and dense. Master the equals/hashCode contract and iteration-order pitfalls early, and the rest of your career will be set up for clean set code.