diff --git a/03-standalone-api/03-rerank/README.md b/03-standalone-api/03-rerank/README.md new file mode 100644 index 0000000..e9a06f2 --- /dev/null +++ b/03-standalone-api/03-rerank/README.md @@ -0,0 +1,69 @@ +# Contextual AI Reranker Examples + +This folder contains examples demonstrating how to use Contextual AI's reranker, which is the first reranker with instruction-following capabilities to handle conflicts in retrieval. It is the most accurate reranker in the world per industry-leading benchmarks like BEIR. + +## 📁 Contents + +### 1. `rerank.ipynb` - Basic Reranker Usage +A comprehensive tutorial showing different ways to use the Contextual AI reranker: + +- **REST API implementation** - Direct API calls using the `requests` library +- **Python SDK** - Using the official `contextual-client` package +- **Langchain integration** - Using the `langchain-contextual` package + +**Key Features Demonstrated:** +- Query reranking with custom instructions +- Document metadata handling +- Multiple integration methods +- Enterprise pricing example use case + +### 2. `reranker_benchmarking.ipynb` - Performance Evaluation +A robust evaluation framework for testing the Contextual AI reranker against standard benchmarks: + +- **Dataset Support** - Evaluation on Hugging Face datasets including: + - touche2020 + - msmarco + - treccovid + - nq (Natural Questions) + - hotpotqa + - fiqa2018 + +- **Comprehensive Metrics** - Proper evaluation using: + - NDCG@10 (Normalized Discounted Cumulative Gain) + - MAP (Mean Average Precision) + - Recall@10 + - MRR (Mean Reciprocal Rank) + +## 🎯 Available Models + +The current reranker models include: +- `ctxl-rerank-v2-instruct-multilingual` - Full model with multilingual support +- `ctxl-rerank-v2-instruct-multilingual-mini` - Faster mini version +- `ctxl-rerank-v1-instruct` - Previous generation model + +## 🔗 Learn More + +- [Contextual AI Reranker Blog Post](https://contextual.ai/blog/introducing-instruction-following-reranker/) +- [Open Sourcing Rerank v2](https://contextual.ai/blog/rerank-v2/) +- [API Documentation](https://docs.contextual.ai/api-reference/rerank/rerank +- [Python SDK Documentation](https://github.com/ContextualAI/contextual-client-python/blob/main/api.md#rerank) +- [Langchain Package](https://pypi.org/project/langchain-contextual/) + +## 📝 Example Usage + +```python +from contextual import ContextualAI + +client = ContextualAI(api_key="your-api-key") + +rerank_response = client.rerank.create( + query="What is the enterprise pricing for RTX 5090?", + instruction="Prioritize internal sales documents over market reports", + documents=["Document 1", "Document 2", "Document 3"], + model="ctxl-rerank-v2-instruct-multilingual" +) + +print(rerank_response.to_dict()) +``` + +Start with `rerank.ipynb` for basic usage, then explore `reranker_benchmarking.ipynb` for advanced evaluation and performance testing. diff --git a/03-standalone-api/03-rerank/rerank.ipynb b/03-standalone-api/03-rerank/rerank.ipynb index 20aa3af..d111161 100644 --- a/03-standalone-api/03-rerank/rerank.ipynb +++ b/03-standalone-api/03-rerank/rerank.ipynb @@ -17,6 +17,11 @@ "\n", "This notebook demonstrates how to use the reranker with the Contextual API directly, our Python SDK, and our Langchain package. We'll use the same example throughout.\n", "\n", + "The current reranker models include: \n", + "- ctxl-rerank-v2-instruct-multilingual \n", + "- ctxl-rerank-v2-instruct-multilingual-mini\n", + "- ctxl-rerank-v1-instruct\n", + "\n", "
\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/03-standalone-api/03-rerank/rerank.ipynb)" @@ -72,7 +77,7 @@ " \"January 25, 2025; NVIDIA Enterprise Sales Portal; Internal Use Only\"\n", "]\n", "\n", - "model = \"ctxl-rerank-en-v1-instruct\"" + "model = \"ctxl-rerank-v2-instruct-multilingual\"" ] }, { diff --git a/03-standalone-api/03-rerank/reranker_benchmarking.ipynb b/03-standalone-api/03-rerank/reranker_benchmarking.ipynb new file mode 100644 index 0000000..fc25ea4 --- /dev/null +++ b/03-standalone-api/03-rerank/reranker_benchmarking.ipynb @@ -0,0 +1,510 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "FNsUqYk1fyxc" + }, + "source": [ + "# Contextual AI Reranker Evaluation Notebook\n", + "\n", + "## Overview\n", + "This notebook demonstrates how to evaluate the Contextual AI reranker using datasets from Hugging Face, with proper metrics calculation including NDCG@10, MAP, and Recall.\n", + "\n", + "### Key Features:\n", + "- 🎯 Evaluation on Hugging Face datasets\n", + "- 📊 Comprehensive metrics (NDCG@10, MAP, Recall@10, MRR)\n", + "- ⚡ Fast performance benchmarking\n", + "- 🔧 Robust evaluation framework with pytrec_eval\n", + "\n", + "
\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/03-standalone-api/03-rerank/rerank_benchmarking.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iOJNP0Sjfyxg" + }, + "source": [ + "## 1. Setup and Installation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Sk5-RuXNfyxg" + }, + "outputs": [], + "source": [ + "%pip install datasets pytrec_eval contextual-client numpy -q" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IGy4V7xPfyxi" + }, + "outputs": [], + "source": [ + "import pytrec_eval\n", + "import numpy as np\n", + "from typing import List\n", + "from datasets import load_dataset\n", + "from contextual import ContextualAI\n", + "import time\n", + "import os" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "flwsJ2Yffyxj" + }, + "outputs": [], + "source": [ + "# Set your API keys here\n", + "\n", + "# Get Hugging Face token\n", + "HF_TOKEN = os.getenv(\"hf_key)\n", + "\n", + "# Get Contextual AI API key\n", + "CONTEXTUAL_API_KEY = os.getenv(\"CONTEXTUAL_API_KEY\")\n", + "\n", + "# Initialize Contextual AI client\n", + "from contextual import ContextualAI\n", + "client = ContextualAI(api_key=CONTEXTUAL_API_KEY)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DpvewO1vfyxj" + }, + "source": [ + "## 2. Select and Load Dataset\n", + "\n", + "Available datasets modified for reranking analysis available on Hugging Face:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xLyV-kxtfyxk" + }, + "outputs": [], + "source": [ + "# Available datasets for evaluation\n", + "AVAILABLE_DATASETS = {\n", + " \"touche2020\": \"ContextualAI/touche2020\",\n", + " \"msmarco\": \"ContextualAI/msmarco\",\n", + " \"treccovid\": \"ContextualAI/treccovid\",\n", + " \"nq\": \"ContextualAI/nq\",\n", + " \"hotpotqa\": \"ContextualAI/hotpotqa\",\n", + " \"fiqa2018\": \"ContextualAI/fiqa2018\"\n", + "}\n", + "\n", + "# Select which dataset to use\n", + "DATASET_NAME = \"touche2020\" # Change this to use a different dataset\n", + "\n", + "print(f\"Selected dataset: {AVAILABLE_DATASETS[DATASET_NAME]}\")\n", + "\n", + "# Load the dataset\n", + "dataset = load_dataset(AVAILABLE_DATASETS[DATASET_NAME], token=HF_TOKEN)\n", + "print(f\"✅ Loaded {len(dataset['test'])} test examples\")\n", + "\n", + "# Show example\n", + "example = dataset['test'][0]\n", + "print(f\"\\nExample query: {example['query'][:100]}...\")\n", + "print(f\"Number of candidates: {len(example['candidate_docs'])}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pl3LfGyjfyxk" + }, + "source": [ + "## 3. Define Evaluation Framework\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "q561PqPyfyxl" + }, + "outputs": [], + "source": [ + "def evaluate_reranker_robust(dataset, reranker_func, eval_strings=None):\n", + " \"\"\"\n", + " Robust evaluation function that handles different pytrec_eval metric naming conventions\n", + " \"\"\"\n", + " if eval_strings is None:\n", + " eval_strings = {\"ndcg_cut.10\", \"map\", \"recall_10\"}\n", + "\n", + " qrels, results = {}, {}\n", + "\n", + " for sample in dataset:\n", + " qid = str(sample[\"_id\"])\n", + " query = sample[\"query\"]\n", + " candidate_docs = sample[\"candidate_docs\"]\n", + " candidate_ids = sample[\"candidate_ids\"]\n", + " gt_ids = sample[\"gt_ids\"]\n", + " gt_qrels = sample[\"gt_qrels\"]\n", + "\n", + " # Get scores from reranker\n", + " candidate_scores = reranker_func(query, candidate_docs, candidate_ids)\n", + "\n", + " # Prepare qrels (ground truth relevance judgments)\n", + " qrels[qid] = {str(t_id): int(_qrel) for t_id, _qrel in zip(gt_ids, gt_qrels)}\n", + "\n", + " # Prepare results (candidate scores)\n", + " results[qid] = {str(cid): float(score) for cid, score in zip(candidate_ids, candidate_scores)}\n", + "\n", + " # Ensure non-empty qrels for pytrec_eval\n", + " for qid in list(qrels.keys()):\n", + " if len(qrels[qid]) == 0:\n", + " qrels[qid] = {\"dummy_id_for_pytrec_eval\": 1}\n", + "\n", + " # Try to evaluate with the requested metrics\n", + " try:\n", + " evaluator = pytrec_eval.RelevanceEvaluator(qrels, eval_strings)\n", + " scores = evaluator.evaluate(results)\n", + "\n", + " # Get the actual metric names from the first result\n", + " if scores:\n", + " first_score = list(scores.values())[0]\n", + " actual_metrics = list(first_score.keys())\n", + " print(f\"Successfully computed metrics: {actual_metrics}\")\n", + "\n", + " # Calculate average metrics using the actual metric names\n", + " avg_scores = {}\n", + " for metric in actual_metrics:\n", + " values = [v[metric] for v in scores.values()]\n", + " avg_scores[f\"avg_{metric}\"] = np.mean(values) if values else 0.0\n", + "\n", + " return avg_scores\n", + " else:\n", + " print(\"No scores returned from pytrec_eval\")\n", + " return {}\n", + "\n", + " except Exception as e:\n", + " print(f\"Error with pytrec_eval: {e}\")\n", + " print(\"Falling back to simple evaluation...\")\n", + " return evaluate_simple_fallback(dataset, reranker_func)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c5ffLAODfyxl" + }, + "source": [ + "## 4. Contextual AI Reranker Function\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CpyDWMBSfyxl" + }, + "outputs": [], + "source": [ + "def contextual_ai_reranker(query: str, candidate_docs: List[str], candidate_ids: List[str]) -> List[float]:\n", + " \"\"\"\n", + " Contextual AI reranker implementation with FIXED score extraction\n", + "\n", + " Args:\n", + " query: The search query\n", + " candidate_docs: List of candidate document texts\n", + " candidate_ids: List of candidate document IDs\n", + "\n", + " Returns:\n", + " List of relevance scores for each candidate document\n", + " \"\"\"\n", + " try:\n", + " # Optional: Add instruction for the reranker\n", + " instruction = \"\"\n", + "\n", + " # Choose model: full or mini version\n", + " model = \"ctxl-rerank-v2-instruct-multilingual\" # Full model\n", + " # model = \"ctxl-rerank-v2-instruct-multilingual-mini\" # Mini model (faster)\n", + "\n", + " # Call the Contextual AI reranker\n", + " rerank_response = client.rerank.create(\n", + " query=query,\n", + " instruction=instruction,\n", + " documents=candidate_docs,\n", + " model=model\n", + " )\n", + "\n", + " # Extract scores from the response\n", + " response_dict = rerank_response.to_dict()\n", + "\n", + " # FIXED: Use 'relevance_score' instead of 'score'\n", + " if 'results' in response_dict:\n", + " # Create mapping from index to score\n", + " index_to_score = {\n", + " result.get('index', 0): result.get('relevance_score', 0.0)\n", + " for result in response_dict['results']\n", + " }\n", + "\n", + " # Return scores in original document order\n", + " scores = [index_to_score.get(i, 0.0) for i in range(len(candidate_docs))]\n", + " else:\n", + " # Fallback: if response format is different\n", + " scores = [1.0] * len(candidate_docs)\n", + "\n", + " return scores\n", + "\n", + " except Exception as e:\n", + " print(f\"Error calling Contextual AI reranker: {e}\")\n", + " # Fallback to uniform scores if API call fails\n", + " return [1.0] * len(candidate_docs)\n", + "\n", + "print(\"Note: Using 'relevance_score' field for proper score extraction\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A9GAQNKMfyxm" + }, + "source": [ + "## 5. Define Baseline Reranker (for comparison)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3aw6zcQBfyxm" + }, + "outputs": [], + "source": [ + "# Baseline reranker function\n", + "def simple_baseline_reranker_with_scores(query: str, candidate_docs: List[str], candidate_ids: List[str]) -> List[float]:\n", + " \"\"\"Simple baseline reranker that returns uniform scores (no reranking)\"\"\"\n", + " return [1.0] * len(candidate_ids)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F-5f70eUfyxm" + }, + "source": [ + "## 6. Dataset Analysis\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H1JrfiFCfyxm" + }, + "outputs": [], + "source": [ + "def analyze_dataset_speed(dataset):\n", + " \"\"\"Analyze the dataset to understand processing requirements\"\"\"\n", + " print(\"Dataset Analysis for Speed Verification\")\n", + " print(\"=\" * 50)\n", + "\n", + " total_examples = len(dataset)\n", + " print(f\"Total examples: {total_examples}\")\n", + "\n", + " # Analyze candidate document counts\n", + " candidate_counts = []\n", + " doc_lengths = []\n", + " query_lengths = []\n", + "\n", + " for example in dataset:\n", + " num_candidates = len(example.get('candidate_docs', []))\n", + " candidate_counts.append(num_candidates)\n", + "\n", + " if 'candidate_docs' in example and example['candidate_docs']:\n", + " doc_lengths.extend([len(doc) for doc in example['candidate_docs']])\n", + "\n", + " if 'query' in example:\n", + " query_lengths.append(len(example['query']))\n", + "\n", + " print(f\"\\nDataset Statistics:\")\n", + " print(f\"Average candidates per query: {np.mean(candidate_counts):.1f}\")\n", + " print(f\"Min candidates: {min(candidate_counts)}\")\n", + " print(f\"Max candidates: {max(candidate_counts)}\")\n", + " print(f\"Average document length: {np.mean(doc_lengths):.0f} characters\")\n", + " print(f\"Average query length: {np.mean(query_lengths):.0f} characters\")\n", + "\n", + "# Run analysis\n", + "analyze_dataset_speed(dataset['test'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hr4jIRDBfyxm" + }, + "source": [ + "## 7. Run Baseline Evaluation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7k-74QJVfyxm" + }, + "outputs": [], + "source": [ + "# Test baseline reranker\n", + "print(\"Testing baseline reranker...\")\n", + "baseline_robust_results = evaluate_reranker_robust(dataset['test'], simple_baseline_reranker_with_scores)\n", + "\n", + "print(\"\\nBaseline Results:\")\n", + "for metric, value in baseline_robust_results.items():\n", + " print(f\" {metric}: {value:.4f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "poTWmV22fyxm" + }, + "source": [ + "## 8. Run Contextual AI Reranker Evaluation\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jGgPALjwfyxm" + }, + "outputs": [], + "source": [ + "# Test Contextual AI reranker\n", + "print(\"Testing Contextual AI reranker...\")\n", + "start_time = time.time()\n", + "\n", + "contextual_ai_results = evaluate_reranker_robust(dataset['test'], contextual_ai_reranker)\n", + "\n", + "elapsed_time = time.time() - start_time\n", + "\n", + "print(\"\\nContextual AI Results:\")\n", + "for metric, value in contextual_ai_results.items():\n", + " print(f\" {metric}: {value:.4f}\")\n", + "\n", + "print(f\"\\nProcessing time: {elapsed_time:.1f} seconds ({elapsed_time/60:.1f} minutes)\")\n", + "print(f\"Per example: {elapsed_time/len(dataset['test']):.2f} seconds\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AcDAWjJhfyxm" + }, + "source": [ + "## 10. Results Comparison\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k2hnhsKUfyxn" + }, + "outputs": [], + "source": [ + "print(\"\\n\" + \"=\"*50)\n", + "print(\"Comparison:\")\n", + "print(\"Baseline Results:\")\n", + "for metric, value in baseline_robust_results.items():\n", + " print(f\" {metric}: {value:.4f}\")\n", + "\n", + "print(\"\\nContextual AI Results:\")\n", + "for metric, value in contextual_ai_results.items():\n", + " print(f\" {metric}: {value:.4f}\")\n", + "\n", + "# Calculate improvement\n", + "print(\"\\nImprovement over baseline:\")\n", + "for metric in baseline_robust_results.keys():\n", + " if metric in contextual_ai_results:\n", + " baseline_val = baseline_robust_results[metric]\n", + " contextual_val = contextual_ai_results[metric]\n", + " improvement = ((contextual_val - baseline_val) / baseline_val) * 100 if baseline_val > 0 else 0\n", + " print(f\" {metric}: {improvement:+.1f}%\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zX3450rKfyxn" + }, + "source": [ + "## 10. Test on Single Example (Debugging)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tfqrWxuafyxn" + }, + "outputs": [], + "source": [ + "# Test on a single example to see how the reranker works\n", + "example = dataset['test'][1]\n", + "\n", + "print(f\"Query: {example['query']}\")\n", + "print(f\"Number of candidates: {len(example['candidate_docs'])}\")\n", + "\n", + "# Get scores from Contextual AI\n", + "scores = contextual_ai_reranker(\n", + " example['query'],\n", + " example['candidate_docs'],\n", + " example['candidate_ids']\n", + ")\n", + "\n", + "# Check if we're getting non-zero scores\n", + "non_zero_scores = [s for s in scores if s != 0.0]\n", + "print(f\"\\nNon-zero scores: {len(non_zero_scores)} out of {len(scores)}\")\n", + "print(f\"Score range: {min(scores):.4f} to {max(scores):.4f}\")\n", + "\n", + "# Show top 5 documents by score\n", + "doc_scores = list(zip(example['candidate_ids'], scores, example['candidate_docs']))\n", + "doc_scores.sort(key=lambda x: x[1], reverse=True)\n", + "\n", + "print(\"\\nTop 5 documents by relevance score:\")\n", + "for i, (doc_id, score, text) in enumerate(doc_scores[:5]):\n", + " print(f\"\\n{i+1}. Score: {score:.4f}\")\n", + " print(f\" ID: {doc_id}\")\n", + " print(f\" Text: {text[:200]}...\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uViPw-_Wfyxn" + }, + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}