Skip to content

Conversation

@heliamoh
Copy link
Collaborator

This PR introduces a comprehensive architectural overhaul of the Reactome chatbot's retrieval and preprocessing systems. The changes replace the existing simple ensemble retriever with a hybrid system and implement a multi-step preprocessing pipeline with parallel execution capabilities.

1. Hybrid Retrieval System Overhaul

1.1 Core Architecture Changes

  • Replaced SelfQueryRetriever with custom HybridRetriever class
  • Added query expansion with parallel multi-source search across all Reactome data subdirectories

1.2 Retrieval Workflow Implementation

The new hybrid system operates on the following pipeline:

  • Parallel Search Execution: Vector + BM25 search across 5 expanded queries per data subdirectory
  • Reciprocal Rank Fusion (RRF): Reranking to combine results into a single, more robust ranking that enhances accuracy by prioritizing documents appearing consistently in top positions across independent searches
  • Asynchronous Processing: All retrieval operations use asyncio.to_thread() for non-blocking execution
  • Parallel Subdirectory Processing: Multiple data sources searched simultaneously

2. Preprocessing Pipeline Implementation

2.1 Multi-Step Workflow Architecture

The new preprocessing system implements a fan-out pattern with the following structure:

Sequential Step 1:

  • Query rephrasing with conversation history integration

Parallel Step 2:

  • Safety assessment and reasoning
  • Query expansion for enhanced retrieval
  • Language detection for multilingual support

3. ReactToMe Profile Enhancements

3.1 Conditional Workflow Routing

The ReactToMe profile now implements intelligent routing based on safety assessment:

  • Questions are only answered if they are ethical, appropriate, and within Reactome scope
  • Unsafe/out-of-scope questions bypass the main Q&A workflow (no RAG + no external search)
  • Contextual refusal responses generated for inappropriate queries

- Implement parallel execution of safety and scope check, query expansion, and language detection
… expansion and conversation history management
- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector)
- Add RRF (Reciprocal Rank Fusion) support for query expansion
- Implement parallel processing for improved performance
… expansion and conversation history management
- Add type annotation for rrf_scores in retrieval_utils.py
- Fix metadata dictionary comprehension in csv_chroma.py
- Update retriever type annotations to use Any
- Add isinstance check for BM25Retriever
- Remove default values from TypedDict in base.py
- Fix TypedDict expansion in postprocess method
- Implement parallel execution of safety and scope check, query expansion, and language detection
- Implement parallel execution of safety and scope check, query expansion, and language detection
- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector)
- Add RRF (Reciprocal Rank Fusion) support for query expansion
- Implement parallel processing for improved performance
@heliamoh heliamoh requested a review from GFJHogue September 28, 2025 20:34
@heliamoh heliamoh self-assigned this Sep 28, 2025
Copy link
Collaborator

@GFJHogue GFJHogue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the wide set of modules significantly modified in this PR, it might be worth considering:

  1. incorporating these changes into a new React-to-Me Beta profile initially (instead of heavily modifying the Base profile), and/or
  2. splitting up this PR into smaller, focused parts.

As a new profile, we could gradually roll this out (ie. test on Release) and easily revert to the original profile, if need be, instead of having to swap out the entire Docker image.


As for my review here, I'm starting with more surface-level things to address (see the attached comments).
Once those are all sorted out, I can delve deeper into reviewing the code logic.

) -> ReactToMeState:
"""Run preprocessing workflow."""
result = await super().preprocess(state, config)
return ReactToMeState(**result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define this preprocess() if it's just going to run the one from the superclass (BaseGraphBuilder)

self.unsafe_answer_generator = create_unsafe_answer_generator(streaming_llm)
self.reactome_rag: Runnable = create_reactome_rag(
llm, embedding, streaming=True
streaming_llm, embedding, streaming=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create streaming_llm like this when we already have this?:

llm = llm.model_copy(update={"streaming": True})

@@ -0,0 +1,60 @@
import json
from typing import List
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

importing List type is deprecated. Current Python just uses list directly.

try:
return json.loads(output)
except json.JSONDecodeError:
raise ValueError("LLM output was not valid JSON. Output:\n" + output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will an LLM emitting invalid JSON crash the chatbot here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please exclude changes to code formatting & comments to unmodified existing code from the diff.


try:
documents = create_documents_from_csv(csv_path)
retriever = BM25Retriever.from_documents(documents)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BM25Retriever missing preprocess_func

"""Main retrieval method supporting RRF and parallel processing."""
original_query = inputs.get("input", "").strip()
if not original_query:
raise ValueError("Input query cannot be empty")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this edge-case crash the chatbot?

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from pathlib import Path
from typing import Any, Dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict type is deprecated, use dict

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safety: str
reason_unsafe: str
expanded_queries: list[str]
detected_language: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove reason_unsafe, expanded_queries, & detected_language as they are never used outside of the preprocess step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants