Diversify Your Search: Improve Retrieval Results with Pyversity

Pyversity is a fast, lightweight Python library built to reduce redundancy in retrieval results by re-ranking items to surface relevant yet diverse content.

Why diversification matters

Traditional relevance-only ranking methods often return top results that are highly similar or near-duplicates. This reduces the usefulness of results by limiting the user's exposure to varied perspectives or options. Diversification balances relevance with novelty so each selected item adds new information compared to the items already chosen. This is important across domains like e-commerce (show different product styles), news (surface different sources and viewpoints), and RAG/LLM pipelines (avoid feeding repetitive passages to language models).

What Pyversity provides

Pyversity offers a unified API for several popular diversification strategies, including Maximal Marginal Relevance (MMR), Max-Sum-Diversification (MSD), Determinantal Point Processes (DPP), and Cover. The library depends only on NumPy, making it lightweight and easy to integrate into existing retrieval pipelines.

Installing dependencies

pip install openai numpy pyversity scikit-learn

Loading OpenAI API Key

import os
from openai import OpenAI
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')
 
client = OpenAI()

Creating a redundant search result set for testing

Below is a simulated set of semantically similar search results you might get from a vector DB after a semantic query like "Smart and loyal dogs for family." The dataset intentionally contains redundancy (multiple entries about Labradors and Golden Retrievers) to show how diversification reduces repetition.

import numpy as np
 
search_results = [
    "The Golden Retriever is the perfect family companion, known for its loyalty and gentle nature.",
    "A Labrador Retriever is highly intelligent, eager to please, and makes an excellent companion for active families.",
    "Golden Retrievers are highly intelligent and trainable, making them ideal for first-time owners.",
    "The highly loyal Labrador is consistently ranked number one for US family pets due to its stable temperament.",
    "Loyalty and patience define the Golden Retriever, one of the top family dogs globally and easily trainable.",
    "For a smart, stable, and affectionate family dog, the Labrador is an excellent choice, known for its eagerness to please.",
    "German Shepherds are famous for their unwavering loyalty and are highly intelligent working dogs, excelling in obedience.",
    "A highly trainable and loyal companion, the German Shepherd excels in family protection roles and service work.",
    "The Standard Poodle is an exceptionally smart, athletic, and surprisingly loyal dog that is also hypoallergenic.",
    "Poodles are known for their high intelligence, often exceeding other breeds in advanced obedience training.",
    "For herding and smarts, the Border Collie is the top choice, recognized as the world's most intelligent dog breed.",
    "The Dachshund is a small, playful dog with a distinctive long body, originally bred in Germany for badger hunting.",
    "French Bulldogs are small, low-energy city dogs, known for their easy-going temperament and comical bat ears.",
    "Siberian Huskies are energetic, friendly, and need significant cold weather exercise due to their running history.",
    "The Beagle is a gentle, curious hound known for its excellent sense of smell and a distinctive baying bark.",
    "The Great Dane is a very large, gentle giant breed; despite its size, it's known to be a low-energy house dog.",
    "The Australian Shepherd (Aussie) is a medium-sized herding dog, prized for its beautiful coat and sharp intellect."
]

Creating embeddings

def get_embeddings(texts):
    """Fetches embeddings from the OpenAI API."""
    print("Fetching embeddings from OpenAI...")
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return np.array([data.embedding for data in response.data])
 
embeddings = get_embeddings(search_results)
print(f"Embeddings shape: {embeddings.shape}")

Ranking by relevance (cosine similarity)

We compute cosine similarity between the query embedding and candidate embeddings to simulate a relevance-only ranking. This often puts near-duplicates at the top.

from sklearn.metrics.pairwise import cosine_similarity
 
query_text = "Smart and loyal dogs for family"
query_embedding = get_embeddings([query_text])[0]
 
 
scores = cosine_similarity(query_embedding.reshape(1, -1), embeddings)[0]
 
print("\n--- Initial Relevance-Only Ranking (Top 5) ---")
initial_ranking_indices = np.argsort(scores)[::-1] # Sort descending
for i in initial_ranking_indices[:5]:
    print(f"Score: {scores[i]:.4f} | Result: {search_results[i]}")

As expected, the top results are dominated by similar descriptions of Labradors and Golden Retrievers. They are all relevant but lack variety.

Maximal Marginal Relevance (MMR)

MMR selects items that balance relevance and novelty by penalizing similarity to already chosen items. This yields results that remain on-topic but are less redundant.

from pyversity import diversify, Strategy
 
# MMR: Focuses on novelty against already picked items.
mmr_result = diversify(
    embeddings=embeddings,
    scores=scores,
    k=5,
    strategy=Strategy.MMR,
    diversity=0.5  # 0.0 is pure relevance, 1.0 is pure diversity
)
 
print("\n\n--- Diversified Ranking using MMR (Top 5) ---")
for rank, idx in enumerate(mmr_result.indices):
    print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")

MMR typically keeps top relevant items but introduces varied breeds like Huskies or French Bulldogs later in the list, reducing repetition.

Max-Sum-Diversification (MSD)

MSD focuses on maximizing the overall spread among selected items, aiming for broad coverage across the result set instead of comparing to previously picked items one-by-one.

# MSD: Focuses on strong spread/distance across all candidates.
msd_result = diversify(
    embeddings=embeddings,
    scores=scores,
    k=5,
    strategy=Strategy.MSD,
    diversity=0.5
)
 
print("\n\n--- Diversified Ranking using MSD (Top 5) ---")
for rank, idx in enumerate(msd_result.indices):
    print(f"Rank {rank+1} (Original Index {idx}): {search_results[idx]}")

MSD tends to return a set that covers distinct breeds and traits, giving users a broader perspective while maintaining relevance.

Takeaway

Pyversity makes it straightforward to add diversification to retrieval pipelines with minimal dependencies. By choosing strategies like MMR or MSD and tuning the diversity parameter, you can reduce redundancy and improve the utility of search results for users.