Support Vector Machines

Foundations of HNSW

Categories of ANN Algorithms:

Approximate nearest neighbors (ANN) algorithms can be broadly categorized into three types:

Trees: Data structures like KD-trees that partition the search space.
Hashes: Techniques like Locality Sensitive Hashing (LSH) that hash similar items to the same bucket.
Graphs: Algorithms that create graphs where vertices represent data points and edges represent proximity relationships.

HNSW falls into the graph category, specifically a proximity graph where vertices are linked based on their proximity to each other, typically using Euclidean distance.

Key Concepts:

Probability Skip List: Introduced by William Pugh in 1990, skip lists use a layered structure to achieve fast search and insertion. Higher layers skip over many nodes, enabling quick searches, while lower layers handle more granular details. HNSW borrows this layered approach to enhance search efficiency. In the HNSW graph, each vertex is assigned a level based on a probabilistic function. The probability 𝑃(𝑙) of a vertex having level 𝑙 follows:

𝑃(𝑙) = 1/2^𝑙

where 𝑙 is the level, and higher levels have exponentially fewer vertices.
Navigable Small World Graphs: NSW graphs combine short-range and long-range links to reduce search complexity to logarithmic levels. Each vertex connects to several others, forming a "friend list". Searching involves a greedy approach where you move to the nearest neighboring vertex until no closer neighbors are found, marking the end of the search.

Creating HNSW

HNSW builds on the principles of NSW by adding a hierarchical structure. Here’s how it works:

Hierarchical Structure: HNSW introduces multiple layers. The top layer contains vertices with long-range links, while lower layers have shorter and denser links. This hierarchy helps in efficiently traversing the graph during searches.
Search Process:

Top Layer: Begin the search at the top layer, which includes long-range connections. This layer often contains high-degree vertices, facilitating a quick initial search.
Lower Layers: After finding a local minimum in the top layer, descend to lower layers and repeat the search process. This ensures a more refined search and helps in reaching the final local minimum in the bottom layer.

Graph Construction: During graph construction, vectors are inserted one-by-one into the HNSW structure. The insertion layer is determined by a probability function, and each vector is added to its insertion layer and all layers below it. Parameters such as M (number of neighbors), M_max (maximum number of links), and M_max0 (maximum links in layer 0) influence the structure and performance of the graph.

Performance and Tuning

Recall vs. Search Time:

Recall: Increasing parameters like M and efSearch improves recall but can lead to higher search times. Optimal settings balance recall and search speed.
Search Time: Higher values of M and efSearch can drastically affect search times. For example, recall can vary from 80% to 100% depending on the settings, but search times can range from 0.1ms to 50ms.

Memory Usage:

HNSW is memory-intensive. The parameter M directly impacts memory usage, with higher values leading to significantly larger index sizes. Memory usage remains unaffected by efConstruction and efSearch.

Improving Memory Usage:

To reduce memory consumption, consider compressing vectors using product quantization (PQ). While PQ reduces recall and increases search time, it helps in managing memory better.

Boosting Search Speeds:

Adding an Inverted File (IVF) component to your HNSW index can enhance search speeds. IVF helps in efficiently locating the relevant portions of the index, speeding up the search process.

!pip install faiss # Install the faiss library

import faiss # Import the faiss library

d = 128  # Dimension of vectors
M = 32    # Number of neighbors in each layer
index = faiss.IndexHNSWFlat(d, M)

xb = ...  # Your dataset of vectors (numpy array)
index.add(xb)

efConstruction = 200  # Number of neighbors considered during index construction
efSearch = 50         # Number of neighbors considered during search
index.hnsw.efConstruction = efConstruction
index.hnsw.efSearch = efSearch

xq = ...  # Your query set (numpy array)
k = 5     # Number of nearest neighbors to retrieve
distances, indices = index.search(xq[:1000], k)

Initialization (init method): The initialization phase sets up the fundamental parameters for the HNSW graph:

M: Maximum number of neighbors each data point can have, which controls the density of the graph.
ef_construction: Parameter influencing the number of neighbors considered during graph construction, affecting the construction time and graph quality.
ef_search: Parameter affecting the search time and quality by determining how many neighbors are considered during the search.

Adding Data Points (add_data_point method): Data points are added to the graph. Each data point is stored in a list and its corresponding entry in the graph is initialized as an empty list. This method prepares the graph structure to accommodate new data points and their neighbors.
Constructing the Graph (construct_graph method): The graph construction involves iterating over all data points and adding their neighbors using the _add_neighbors method. This method creates connections between data points based on proximity, using the ef_construction parameter to control the number of neighbors.
Adding Neighbors (_add_neighbors method): This method is responsible for adding neighbors to a specific data point: It calculates the neighbors based on proximity and the ef parameter. It updates the graph with these neighbors.
Getting Neighbors (_get_neighbors method): The _get_neighbors method finds and sorts the closest neighbors of a given data point: It computes the distance between the target data point and all other data points. Neighbors are sorted based on distance, and the top ef closest points are selected to form the neighbor list.
Searching (search method): This method performs a search operation: It first ensures that the query point is present in the graph by adding it if necessary. It then searches for the k nearest neighbors using the _search method. The ef_search parameter controls the number of neighbors considered during the search process.
Search Process (_search method): The _search method is responsible for finding the nearest neighbors of a query point: It performs a greedy search by iterating over the neighbors and updating the candidate list. The search is conducted in a depth-first manner, expanding to neighbors and recursively searching through them. The ef parameter influences the number of neighbors considered at each step, affecting the search quality and efficiency.
Getting Neighbors (get_neighbors method): This method retrieves the list of neighbors for a given data point from the graph. It is used to access the connections established during the graph construction.

import numpy as np
import matplotlib.pyplot as plt
import networkx as nx

class HNSWGraph:
    def __init__(self, M, ef_construction, ef_search):
        self.M = M  # Maximum number of neighbors
        self.ef_construction = ef_construction  # Construction time parameter
        self.ef_search = ef_search  # Search time parameter
        self.data = []  # Data points
        self.graph = {}  # Graph structure

    def add_data_point(self, data_point):
        self.data.append(data_point)
        self.graph[tuple(data_point.flatten())] = []

    def construct_graph(self):
        for i, data_point in enumerate(self.data):
            self._add_neighbors(tuple(data_point.flatten()), self.ef_construction)

    def _add_neighbors(self, data_point, ef):
        neighbors = self._get_neighbors(data_point, ef)
        self.graph[data_point] = neighbors

    def _get_neighbors(self, data_point, ef):
        distances = [(np.linalg.norm(np.array(data_point) - other), tuple(other.flatten())) for other in self.data if not np.array_equal(np.array(data_point), other)]
        distances.sort()
        return [x[1] for x in distances[:ef]]

    def search(self, query, k):
        query_tuple = tuple(query.flatten())
        if query_tuple not in self.graph:
            self.add_data_point(query)
            self._add_neighbors(query_tuple, self.ef_construction)

        visited = set()
        candidates = []
        self._search(query_tuple, k, visited, candidates, self.ef_search)

        # Ensure only k neighbors are returned
        unique_candidates = list(set(candidates))
        return sorted(unique_candidates, key=lambda x: np.linalg.norm(np.array(query) - np.array(x)))[:k]

    def _search(self, query, k, visited, candidates, ef):
        if len(candidates) >= k:
            return
        for neighbor in self.graph[query]:
            if neighbor not in visited:
                visited.add(neighbor)
                distances = [(np.linalg.norm(np.array(query) - np.array(other)), tuple(other.flatten())) for other in self.data if not np.array_equal(np.array(query), other)]
                distances.sort()
                new_candidates = [x[1] for x in distances[:ef]]
                candidates.extend(new_candidates)
                self._search(neighbor, k, visited, candidates, ef)

    def get_neighbors(self, data_point):
        return self.graph[data_point]

    def visualize_graph(self):
        G = nx.Graph()
        for node, neighbors in self.graph.items():
            for neighbor in neighbors:
                if node != neighbor:
                    G.add_edge(node, neighbor)

        pos = {node: (node[0], node[1]) for node in G.nodes()}
        plt.figure(figsize=(12, 10))
        nx.draw(G, pos, with_labels=False, node_size=30, node_color='blue', edge_color='gray')
        plt.title("HNSW Graph")
        plt.show()

    def visualize_search(self, query, neighbors):
        G = nx.Graph()
        # Add edges between the query point and its k-nearest neighbors
        query_pos = tuple(query.flatten())
        G.add_node(query_pos)
        for neighbor in neighbors:
            neighbor_pos = tuple(neighbor.flatten()) if isinstance(neighbor, np.ndarray) else neighbor
            G.add_node(neighbor_pos)
            G.add_edge(query_pos, neighbor_pos)

        pos = {node: (node[0], node[1]) for node in G.nodes()}
        plt.figure(figsize=(12, 10))
        nx.draw(G, pos, with_labels=False, node_size=100, node_color='blue', edge_color='gray')

        # Plot query point and nearest neighbors
        plt.scatter([query_pos[0]], [query_pos[1]], color='red', s=500, edgecolor='black', label='Query Point')
        plt.scatter([pos[0] for pos in G.nodes() if pos != query_pos], [pos[1] for pos in G.nodes() if pos != query_pos], color='green', s=500, edgecolor='black', label='Nearest Neighbors')

        plt.legend()
        plt.title("HNSW Search")
        plt.show()


# Create a sample dataset
np.random.seed(0)
data = np.random.rand(100, 2)  # Reduced dimension to 2 for visualization

# Create an HNSW graph
graph = HNSWGraph(M=10, ef_construction=50, ef_search=10)

# Add data points to the graph
for point in data:
    graph.add_data_point(point)

# Construct the graph
graph.construct_graph()

# Search for k-nearest neighbors
query = np.random.rand(1, 2)[0]  # Reduced dimension to 2 for visualization
k = 5
neighbors = graph.search(query, k)

# Visualize the graph
graph.visualize_graph()

# Visualize the search results
graph.visualize_search(query, neighbors)

Hierarchical Navigable Small World(HNSW) Graphs

Foundations of HNSW

Categories of ANN Algorithms:

Key Concepts:

𝑃(𝑙) = 1/2^𝑙

Creating HNSW

Performance and Tuning

Recall vs. Search Time:

Memory Usage:

Improving Memory Usage:

Boosting Search Speeds:

HNSW Using FAISS

HNSW From scratch

Hierarchical Navigable Small World(HNSW) Graphs

Foundations of HNSW

Categories of ANN Algorithms:

Key Concepts:

𝑃(𝑙) = 1/2𝑙

Creating HNSW

Performance and Tuning

Recall vs. Search Time:

Memory Usage:

Improving Memory Usage:

Boosting Search Speeds:

HNSW Using FAISS

HNSW From scratch

𝑃(𝑙) = 1/2^𝑙