Sets — Interview Questions¶
Junior Level (5-7 Questions)¶
Q1: What is a set in Python and how does it differ from a list?¶
Answer
A set is an **unordered collection of unique, hashable elements**. Key differences from lists: | Feature | Set | List | |---------|-----|------| | Order | Unordered | Ordered (preserves insertion order) | | Duplicates | Not allowed | Allowed | | Indexing | Not supported | Supported (`list[0]`) | | Lookup | O(1) average | O(n) | | Elements | Must be hashable | Any type | | Syntax | `{1, 2, 3}` | `[1, 2, 3]` |Q2: How do you create an empty set? Why can't you use {}?¶
Answer
Use `set()` to create an empty set. `{}` creates an empty **dictionary**, not a set, because dict literals came first in Python's history.Q3: What is the difference between remove() and discard()?¶
Answer
Both remove an element from a set, but they differ in error handling: - **`remove(x)`** — raises `KeyError` if `x` is not in the set - **`discard(x)`** — does nothing if `x` is not in the set (no error) **Best practice:** Use `discard()` when you are not sure the element exists.Q4: How do you remove duplicates from a list?¶
Answer
**Simple (does not preserve order):** **Preserving order:** **Python 3.7+ (using dict):**Q5: What are the four main set operations?¶
Answer
| Operation | Operator | Method | Result | |-----------|----------|--------|--------| | Union | `a \| b` | `a.union(b)` | All elements from both sets | | Intersection | `a & b` | `a.intersection(b)` | Elements common to both | | Difference | `a - b` | `a.difference(b)` | Elements in `a` but not `b` | | Symmetric Difference | `a ^ b` | `a.symmetric_difference(b)` | Elements in either but not both |Q6: What is a frozenset and when would you use it?¶
Answer
A `frozenset` is an **immutable** version of a set. It cannot be modified after creation. **Use cases:** - As a dictionary key (regular sets are unhashable) - As an element of another set - For constant/config values that should not changeQ7: What does this code output?¶
Answer
Output: `True == 1 == 1.0` and `False == 0`, so they are considered the same values. The set keeps the first encountered representation.Middle Level (4-6 Questions)¶
Q1: Why must set elements be hashable? What makes an object hashable?¶
Answer
Sets use **hash tables** internally. Each element's hash determines its storage slot. For this to work: 1. The object must implement `__hash__()` — returns an integer 2. The object must implement `__eq__()` — for collision resolution 3. The **hash/eq contract** must hold: `if a == b then hash(a) == hash(b)` 4. The hash must not change during the object's lifetime **Hashable:** `int`, `str`, `float`, `tuple` (if all elements are hashable), `frozenset` **Unhashable:** `list`, `dict`, `set` (they are mutable)Q2: What is the time complexity of set operations?¶
Answer
| Operation | Average | Worst Case | |-----------|---------|------------| | `x in s` | O(1) | O(n) | | `s.add(x)` | O(1) | O(n) (resize) | | `s.remove(x)` | O(1) | O(n) | | `s \| t` (union) | O(len(s) + len(t)) | O(len(s) * len(t)) | | `s & t` (intersection) | O(min(len(s), len(t))) | O(len(s) * len(t)) | | `s - t` (difference) | O(len(s)) | O(len(s) * len(t)) | | `s <= t` (subset) | O(len(s)) | O(len(s) * len(t)) | Worst case occurs with pathological hash collisions (all elements in the same bucket).Q3: Explain the difference between set operators and set methods.¶
Answer
**Operators** (`|`, `&`, `-`, `^`) require **both operands to be sets**. **Methods** (`.union()`, `.intersection()`, etc.) accept **any iterable**.s = {1, 2, 3}
# Operator — TypeError with non-set
# s | [4, 5] # TypeError: unsupported operand type(s)
# Method — works with any iterable
s.union([4, 5]) # OK
s.intersection(range(3)) # OK
s.difference("abc") # OK (iterates chars)
# Methods also accept multiple arguments
s.union([4], [5], [6]) # {1, 2, 3, 4, 5, 6}
Q4: How do sets handle hash collisions?¶
Answer
CPython uses **open addressing** with perturbation-based probing: 1. Compute `slot = hash(element) & mask` 2. If slot is occupied by a different element (collision), compute next slot: 3. Repeat until an empty slot is found This avoids clustering issues of pure linear probing while being cache-friendly.# All elements hash to slot 0 — forces collision chain
class BadHash:
def __init__(self, val):
self.val = val
def __hash__(self):
return 0 # All same hash!
def __eq__(self, other):
return isinstance(other, BadHash) and self.val == other.val
# Lookup degrades to O(n)
import timeit
s = {BadHash(i) for i in range(1000)}
print(timeit.timeit(lambda: BadHash(999) in s, number=10000))
Q5: When would you use a set vs a dict for membership testing?¶
Answer
| Use Case | Choose | Why | |----------|--------|-----| | Simple membership check | `set` | Lower memory, simpler code | | Membership + associated data | `dict` | Store key-value pairs | | Need ordered unique elements | `dict.fromkeys()` | Dicts preserve insertion order (3.7+) | | Need counting | `collections.Counter` | Dict subclass for counting |Q6: What happens internally when a set resizes?¶
Answer
When the load factor exceeds ~2/3, CPython: 1. Allocates a new table (typically 2x-4x the current size, always power of 2) 2. Rehashes ALL existing elements into the new table 3. Frees the old table (unless it was the inline `smalltable`) This is an O(n) operation, but happens infrequently enough that amortized cost is O(1).Senior Level (4-6 Questions)¶
Q1: How would you implement a thread-safe set in Python?¶
Answer
While individual set operations are atomic under the GIL, compound operations (check-then-act) are not. A proper thread-safe set requires explicit locking:import threading
from typing import TypeVar, Generic
T = TypeVar("T")
class ThreadSafeSet(Generic[T]):
def __init__(self):
self._data: set[T] = set()
self._lock = threading.RLock()
def add(self, item: T) -> bool:
"""Add item. Returns True if it was actually added."""
with self._lock:
size_before = len(self._data)
self._data.add(item)
return len(self._data) > size_before
def add_if_absent(self, item: T) -> bool:
"""Atomic check-and-add. Returns True if added."""
with self._lock:
if item not in self._data:
self._data.add(item)
return True
return False
def snapshot(self) -> frozenset[T]:
with self._lock:
return frozenset(self._data)
Q2: Design a caching system using sets for cache invalidation tags.¶
Answer
from __future__ import annotations
from typing import Any
from collections import defaultdict
import time
class TagBasedCache:
"""Cache with set-based tag invalidation."""
def __init__(self):
self._cache: dict[str, tuple[Any, frozenset[str], float]] = {}
self._tag_to_keys: dict[str, set[str]] = defaultdict(set)
def set(self, key: str, value: Any, tags: set[str], ttl: float = 300) -> None:
frozen_tags = frozenset(tags)
self._cache[key] = (value, frozen_tags, time.monotonic() + ttl)
for tag in tags:
self._tag_to_keys[tag].add(key)
def get(self, key: str) -> Any | None:
entry = self._cache.get(key)
if entry is None:
return None
value, tags, expires = entry
if time.monotonic() > expires:
self._evict(key, tags)
return None
return value
def invalidate_tags(self, tags: set[str]) -> int:
"""Invalidate all cache entries matching ANY of the given tags."""
keys_to_invalidate: set[str] = set()
for tag in tags:
keys_to_invalidate |= self._tag_to_keys.get(tag, set())
for key in keys_to_invalidate:
entry = self._cache.pop(key, None)
if entry:
_, entry_tags, _ = entry
for t in entry_tags:
self._tag_to_keys[t].discard(key)
return len(keys_to_invalidate)
def _evict(self, key: str, tags: frozenset[str]) -> None:
self._cache.pop(key, None)
for tag in tags:
self._tag_to_keys[tag].discard(key)
Q3: Explain the memory overhead of sets vs alternative data structures for membership testing.¶
Answer
import sys
n = 1_000_000
# Option 1: set — ~40-50 bytes per element total
s = set(range(n))
set_size = sys.getsizeof(s)
# Option 2: frozenset — similar to set
fs = frozenset(range(n))
frozenset_size = sys.getsizeof(fs)
# Option 3: sorted list + bisect — ~8 bytes per element
import array
a = array.array("q", sorted(range(n)))
array_size = sys.getsizeof(a)
# Option 4: bytearray bitmap — 1 byte per possible value
ba = bytearray(n)
for i in range(n):
ba[i] = 1
bitmap_size = sys.getsizeof(ba)
print(f"set: {set_size:>12,} bytes (hash table overhead)")
print(f"frozenset: {frozenset_size:>12,} bytes")
print(f"array: {array_size:>12,} bytes (O(log n) lookup)")
print(f"bitmap: {bitmap_size:>12,} bytes (O(1) lookup, limited range)")
Q4: How does PYTHONHASHSEED protect against hash collision attacks?¶
Answer
Before Python 3.3, string hashes were deterministic. An attacker could craft inputs that all map to the same hash bucket, degrading O(1) to O(n) — a **HashDoS** attack. **PYTHONHASHSEED** seeds string hashing with a random value at interpreter startup: **Impact on sets:** - Iteration order changes between runs - Tests that depend on set ordering break - Fix: `PYTHONHASHSEED=0` for reproducible tests, or use `sorted()` for deterministic output **What is NOT protected:** Integer and float hashes are still deterministic, so integer-based HashDoS is theoretically possible.Q5: How would you efficiently find the symmetric difference of 100 sets?¶
Answer
The symmetric difference of multiple sets (elements appearing in an **odd** number of sets) can be computed efficiently:from functools import reduce
from collections import Counter
from typing import Iterable
# Method 1: Reduce with ^ (simple but creates many intermediate sets)
def symmetric_diff_reduce(sets: list[set]) -> set:
return reduce(set.symmetric_difference, sets)
# Method 2: Count occurrences — keep elements appearing odd number of times
def symmetric_diff_counter(sets: list[set]) -> set:
counts: Counter = Counter()
for s in sets:
counts.update(s)
return {elem for elem, count in counts.items() if count % 2 == 1}
# Method 3: XOR-based bitwise (for integer elements in known range)
def symmetric_diff_bitwise(sets: list[set[int]], max_val: int) -> set[int]:
bits = bytearray(max_val + 1)
for s in sets:
for elem in s:
bits[elem] ^= 1
return {i for i, b in enumerate(bits) if b}
# Benchmark
import timeit
test_sets = [set(range(i, i + 1000)) for i in range(0, 100_000, 1000)]
t1 = timeit.timeit(lambda: symmetric_diff_reduce(test_sets), number=100)
t2 = timeit.timeit(lambda: symmetric_diff_counter(test_sets), number=100)
print(f"Reduce: {t1:.3f}s, Counter: {t2:.3f}s")
Q6: Describe the internal difference between set and frozenset in CPython.¶
Answer
Both use the same `PySetObject` C struct, but differ in: 1. **Type object**: `PySet_Type` vs `PyFrozenSet_Type` 2. **`hash` field**: Sets store `-1` (sentinel for "no hash"). Frozensets compute and cache their hash. 3. **Mutation methods**: `set` exposes `add`, `remove`, `discard`, `pop`, `clear`, `update`. `frozenset` does not. 4. **Empty singleton**: `frozenset()` returns a cached singleton. `set()` always creates a new object. 5. **`tp_hash` slot**: `set` has `tp_hash = PyObject_HashNotImplemented` (unhashable). `frozenset` computes hash from element hashes using XOR with mixing.Scenario-Based Questions (3-5)¶
Scenario 1: Data Deduplication Pipeline¶
Problem: You receive 10 million user records per day from multiple sources. Each record has an email field. Design a deduplication pipeline.
Answer
from typing import Iterator, Dict, Any
def deduplicate_stream(
records: Iterator[Dict[str, Any]],
key_field: str = "email",
) -> Iterator[Dict[str, Any]]:
"""Memory-efficient deduplication for streaming data."""
seen: set[str] = set()
for record in records:
key = record.get(key_field, "").lower().strip()
if key and key not in seen:
seen.add(key)
yield record
# For very large datasets, consider:
# 1. Bloom filter for probabilistic dedup (saves memory)
# 2. Partitioned processing (hash key % N partitions)
# 3. External storage (Redis SET, database DISTINCT)
Scenario 2: Real-Time Access Control¶
Problem: Design a permission system where users can have multiple roles, each with a set of permissions. Check access in < 1 microsecond.
Answer
from functools import lru_cache
class AccessControl:
def __init__(self):
self._role_perms: dict[str, frozenset[str]] = {}
def define_role(self, role: str, perms: set[str]) -> None:
self._role_perms[role] = frozenset(perms)
self._resolve_cached.cache_clear()
@lru_cache(maxsize=1024)
def _resolve_cached(self, roles: frozenset[str]) -> frozenset[str]:
"""Cached permission resolution."""
result: set[str] = set()
for role in roles:
result |= self._role_perms.get(role, frozenset())
return frozenset(result)
def check(self, user_roles: frozenset[str], required: str) -> bool:
resolved = self._resolve_cached(user_roles)
return required in resolved # O(1) lookup
# Frozenset as cache key enables LRU caching of permission resolution
Scenario 3: API Rate Limiter¶
Problem: Implement a rate limiter that tracks active API keys and blocks banned keys.
Answer
import time
import threading
from collections import defaultdict
class RateLimiter:
def __init__(self, max_requests: int, window_seconds: float):
self.max_requests = max_requests
self.window = window_seconds
self._banned: set[str] = set()
self._requests: dict[str, list[float]] = defaultdict(list)
self._lock = threading.Lock()
def ban(self, api_key: str) -> None:
with self._lock:
self._banned.add(api_key)
def unban(self, api_key: str) -> None:
with self._lock:
self._banned.discard(api_key)
def is_allowed(self, api_key: str) -> bool:
with self._lock:
# O(1) ban check
if api_key in self._banned:
return False
now = time.monotonic()
# Clean old requests
reqs = self._requests[api_key]
cutoff = now - self.window
self._requests[api_key] = [t for t in reqs if t > cutoff]
if len(self._requests[api_key]) >= self.max_requests:
return False
self._requests[api_key].append(now)
return True
FAQ¶
Q: Are Python sets ordered?¶
A: No. Unlike dicts (which preserve insertion order since Python 3.7), sets do not guarantee any order. If you need ordered unique elements, use dict.fromkeys().
Q: Can I store None in a set?¶
A: Yes. None is hashable. {None, 1, 2} is valid.
Q: Is set or frozenset faster?¶
A: For individual operations (add, lookup), they are the same speed. frozenset has an advantage when used as a dict key or set element because its hash is cached after first computation.
Q: How much memory does a set use compared to a list?¶
A: A set uses roughly 4-8x more memory per element than a list, due to hash table overhead (empty slots, hash caching). For 1000 integers: list ~8 KB, set ~32 KB.
Q: Can I use a set as a dictionary key?¶
A: No — sets are mutable and unhashable. Use frozenset instead: {frozenset({1, 2}): "value"}.