React to me architectural upgrade - Advanced hybrid retrieval, preprocessing pipeline, and safety system #97

heliamoh · 2025-09-28T20:34:27Z

This PR introduces a comprehensive architectural overhaul of the Reactome chatbot's retrieval and preprocessing systems. The changes replace the existing simple ensemble retriever with a hybrid system and implement a multi-step preprocessing pipeline with parallel execution capabilities.

1. Hybrid Retrieval System Overhaul

1.1 Core Architecture Changes

Replaced SelfQueryRetriever with custom HybridRetriever class
Added query expansion with parallel multi-source search across all Reactome data subdirectories

1.2 Retrieval Workflow Implementation

The new hybrid system operates on the following pipeline:

Parallel Search Execution: Vector + BM25 search across 5 expanded queries per data subdirectory
Reciprocal Rank Fusion (RRF): Reranking to combine results into a single, more robust ranking that enhances accuracy by prioritizing documents appearing consistently in top positions across independent searches
Asynchronous Processing: All retrieval operations use asyncio.to_thread() for non-blocking execution
Parallel Subdirectory Processing: Multiple data sources searched simultaneously

2. Preprocessing Pipeline Implementation

2.1 Multi-Step Workflow Architecture

The new preprocessing system implements a fan-out pattern with the following structure:

Sequential Step 1:

Query rephrasing with conversation history integration

Parallel Step 2:

Safety assessment and reasoning
Query expansion for enhanced retrieval
Language detection for multilingual support

3. ReactToMe Profile Enhancements

3.1 Conditional Workflow Routing

The ReactToMe profile now implements intelligent routing based on safety assessment:

Questions are only answered if they are ethical, appropriate, and within Reactome scope
Unsafe/out-of-scope questions bypass the main Q&A workflow (no RAG + no external search)
Contextual refusal responses generated for inappropriate queries

- Implement parallel execution of safety and scope check, query expansion, and language detection

… expansion and conversation history management

- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector) - Add RRF (Reciprocal Rank Fusion) support for query expansion - Implement parallel processing for improved performance

… expansion and conversation history management

- Add type annotation for rrf_scores in retrieval_utils.py - Fix metadata dictionary comprehension in csv_chroma.py - Update retriever type annotations to use Any - Add isinstance check for BM25Retriever - Remove default values from TypedDict in base.py - Fix TypedDict expansion in postprocess method

- Implement parallel execution of safety and scope check, query expansion, and language detection

- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector) - Add RRF (Reciprocal Rank Fusion) support for query expansion - Implement parallel processing for improved performance

…ination mitigation

GFJHogue

Due to the wide set of modules significantly modified in this PR, it might be worth considering:

incorporating these changes into a new React-to-Me Beta profile initially (instead of heavily modifying the Base profile), and/or
splitting up this PR into smaller, focused parts.

As a new profile, we could gradually roll this out (ie. test on Release) and easily revert to the original profile, if need be, instead of having to swap out the entire Docker image.

As for my review here, I'm starting with more surface-level things to address (see the attached comments).
Once those are all sorted out, I can delve deeper into reviewing the code logic.

GFJHogue · 2025-09-30T16:13:59Z

src/agent/profiles/react_to_me.py

+    ) -> ReactToMeState:
+        """Run preprocessing workflow."""
+        result = await super().preprocess(state, config)
+        return ReactToMeState(**result)


No need to define this preprocess() if it's just going to run the one from the superclass (BaseGraphBuilder)

GFJHogue · 2025-09-30T16:21:06Z

src/agent/profiles/react_to_me.py

+        self.unsafe_answer_generator = create_unsafe_answer_generator(streaming_llm)
        self.reactome_rag: Runnable = create_reactome_rag(
-            llm, embedding, streaming=True
+            streaming_llm, embedding, streaming=True


Why create streaming_llm like this when we already have this?:

reactome_chatbot/src/retrievers/reactome/rag.py

Line 31 in e398a37

llm = llm.model_copy(update={"streaming": True})

GFJHogue · 2025-09-30T16:23:41Z

src/agent/tasks/query_expansion.py

@@ -0,0 +1,60 @@
+import json
+from typing import List


importing List type is deprecated. Current Python just uses list directly.

GFJHogue · 2025-09-30T16:26:57Z

src/agent/tasks/query_expansion.py

+    try:
+        return json.loads(output)
+    except json.JSONDecodeError:
+        raise ValueError("LLM output was not valid JSON. Output:\n" + output)


Will an LLM emitting invalid JSON crash the chatbot here?

GFJHogue · 2025-09-30T16:32:50Z

src/agent/profiles/base.py

Please exclude changes to code formatting & comments to unmodified existing code from the diff.

GFJHogue · 2025-09-30T16:55:14Z

src/retrievers/csv_chroma.py

+
+        try:
+            documents = create_documents_from_csv(csv_path)
+            retriever = BM25Retriever.from_documents(documents)


BM25Retriever missing preprocess_func

GFJHogue · 2025-09-30T16:58:41Z

src/retrievers/csv_chroma.py

+        """Main retrieval method supporting RRF and parallel processing."""
+        original_query = inputs.get("input", "").strip()
+        if not original_query:
+            raise ValueError("Input query cannot be empty")


could this edge-case crash the chatbot?

GFJHogue · 2025-09-30T17:00:19Z

src/retrievers/rag_chain.py

-from langchain.chains.combine_documents import create_stuff_documents_chain
-from langchain.chains.retrieval import create_retrieval_chain
+from pathlib import Path
+from typing import Any, Dict


Dict type is deprecated, use dict

GFJHogue · 2025-09-30T17:05:03Z

src/retrievers/retrieval_utils.py

LangChain's EnsemblerRetriever which we're using already implements RRF:

https://python.langchain.com/docs/how_to/ensemble_retriever/

https://python.langchain.com/api_reference/langchain/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html

GFJHogue · 2025-09-30T17:20:53Z

src/agent/profiles/base.py

+    safety: str
+    reason_unsafe: str
+    expanded_queries: list[str]
+    detected_language: str


remove reason_unsafe, expanded_queries, & detected_language as they are never used outside of the preprocess step

heliamoh added 13 commits September 27, 2025 17:51

feat: expnad preprocessing to a multi-step workflow.

2607236

- Implement parallel execution of safety and scope check, query expansion, and language detection

feat: Add new runnables for checking question safety and scope, query…

8b35029

… expansion and conversation history management

feat:improved hybrid retrieval

8b2578f

- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector) - Add RRF (Reciprocal Rank Fusion) support for query expansion - Implement parallel processing for improved performance

feat: Add new runnables for checking question safety and scope, query…

b2cc4bb

… expansion and conversation history management

code quality check fixes

3b9e95d

remove: Remove reactome_kg directory from repository

f35f3e0

code quality fixes

ba01931

feat: expnad preprocessing to a multi-step workflow.

7f8d4c5

- Implement parallel execution of safety and scope check, query expansion, and language detection

feat: expnad preprocessing to a multi-step workflow.

67fcd60

- Implement parallel execution of safety and scope check, query expansion, and language detection

feat:improved hybrid retrieval

3ea2ba8

- Replace SelfQueryRetriever with efficient hybrid search (BM25 + vector) - Add RRF (Reciprocal Rank Fusion) support for query expansion - Implement parallel processing for improved performance

feat:improved answer generation, in-line citation handling and halluc…

5b82199

…ination mitigation

remove irrelevant docs

27e761c

heliamoh requested a review from GFJHogue September 28, 2025 20:34

heliamoh self-assigned this Sep 28, 2025

GFJHogue requested changes Sep 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

React to me architectural upgrade - Advanced hybrid retrieval, preprocessing pipeline, and safety system #97

React to me architectural upgrade - Advanced hybrid retrieval, preprocessing pipeline, and safety system #97

Uh oh!

heliamoh commented Sep 28, 2025

Uh oh!

GFJHogue left a comment

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

GFJHogue Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

React to me architectural upgrade - Advanced hybrid retrieval, preprocessing pipeline, and safety system #97

Are you sure you want to change the base?

React to me architectural upgrade - Advanced hybrid retrieval, preprocessing pipeline, and safety system #97

Uh oh!

Conversation

heliamoh commented Sep 28, 2025

1. Hybrid Retrieval System Overhaul

1.1 Core Architecture Changes

1.2 Retrieval Workflow Implementation

2. Preprocessing Pipeline Implementation

2.1 Multi-Step Workflow Architecture

3. ReactToMe Profile Enhancements

3.1 Conditional Workflow Routing

Uh oh!

GFJHogue left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants