RayforceDB Benchmark Suite

Benchmarking framework for comparing RayforceDB against popular DataFrame libraries and databases.

Live Results

Quick Start

# Clone the repository
git clone https://github.com/anthropics/rayforce-bench.git
cd rayforce-bench

# Install dependencies
make setup

# Generate benchmark data (1M rows)
make data

# Run benchmarks
make bench

Results are generated in docs/index.html.

Adapters

Adapter	Type	Description
`rayforce`	Embedded	RayforceDB native execution via `timeit`
`polars`	Embedded	Polars DataFrame (Rust-based)
`duckdb`	Embedded	DuckDB embedded SQL
`pandas`	Embedded	Pandas DataFrame
`questdb`	Server	QuestDB via PostgreSQL protocol
`timescale`	Server	TimescaleDB (PostgreSQL)

Benchmarks

Based on H2O.ai db-benchmark:

GroupBy Queries

Q1: sum(v1) group by id1
Q2: sum(v1) group by id1, id2
Q3: sum(v1), mean(v3) group by id3
Q4: mean(v1), mean(v2), mean(v3) group by id3
Q5: sum(v1), sum(v2), sum(v3) group by id3

Join Queries

Inner Join: Join on id1
Left Join: Join on id1

Sort Queries

Single Column: Sort by id1
Multi Column: Sort by id1, id2, id3

Usage

Basic Commands

# Check dependencies
make check

# Generate data
make data           # 1M rows (default)
make data-small     # 100K rows (quick tests)
make data-large     # 10M rows (production benchmarks)

# Run benchmarks
make bench          # Default adapters (pandas, polars, duckdb, rayforce)
make bench-all      # All adapters (requires Docker for QuestDB/TimescaleDB)

Running Individual Benchmark Suites

# GroupBy only
python -m bench.runner groupby -d data/groupby_1m_k100 -a pandas polars duckdb rayforce

# Join only
python -m bench.runner join -d data/join_1m_100k -a pandas polars duckdb rayforce

# Sort only
python -m bench.runner sort -d data/sort_1m_k100 -a pandas polars duckdb rayforce

# All suites
python -m bench.runner all -d data/groupby_1m_k100 -a pandas polars duckdb rayforce

CLI Options

python -m bench.runner <benchmark> [options]

Arguments:
  benchmark             groupby, join, sort, or all

Options:
  -d, --data PATH       Path to dataset directory (required)
  -a, --adapters LIST   Adapters to benchmark (default: pandas polars duckdb rayforce)
  -i, --iterations N    Number of measured iterations (default: 5)
  -w, --warmup N        Number of warmup iterations (default: 2)
  --rayforce-local PATH Path to local rayforce-py repo for dev builds
  --html PATH           Output HTML report path (default: docs/index.html)
  --no-html             Skip HTML report generation
  --no-docker           Don't auto-start Docker containers
  --stop-infra          Stop Docker containers after benchmarks
  --check-deps          Check dependencies and exit

Benchmarking with Local Rayforce Build

To benchmark a development build of rayforce-py:

# Method 1: Using make
make bench-local RAYFORCE_LOCAL=~/rayforce-py

# Method 2: Direct command
python -m bench.runner groupby \
    -d data/groupby_1m_k100 \
    -a pandas polars duckdb rayforce \
    --rayforce-local ~/rayforce-py

# Method 3: Install locally first, then benchmark
cd ~/rayforce-py
pip install -e .
cd ~/rayforce-bench
make bench

The --rayforce-local option will:

Build rayforce-py from the specified path
Use the local build for benchmarks
Show version as X.Y.Z (local: /path/to/rayforce-py)

Server-Based Adapters (Docker)

QuestDB and TimescaleDB require Docker:

# Start containers
make infra-start

# Run benchmarks with all adapters
make bench-all

# Stop containers
make infra-stop

# Check container status
make infra-status

# Remove containers completely
make infra-cleanup

Container configuration:

QuestDB: Port 8812 (PostgreSQL wire protocol)
TimescaleDB: Port 5433 (to avoid conflict with local PostgreSQL)

Project Structure

rayforce-bench/
├── bench/
│   ├── adapters/           # Database adapters
│   │   ├── base.py         # Abstract Adapter interface
│   │   ├── pandas_adapter.py
│   │   ├── polars_adapter.py
│   │   ├── duckdb_adapter.py
│   │   ├── rayforce_adapter.py
│   │   ├── questdb_adapter.py
│   │   └── timescale_adapter.py
│   ├── generators/         # Data generators
│   │   ├── groupby.py      # H2O-style groupby data
│   │   ├── join.py         # Join benchmark data
│   │   └── sort.py         # Sort benchmark data
│   ├── runner.py           # Benchmark runner CLI
│   ├── report.py           # HTML report generator
│   ├── infra.py            # Docker infrastructure management
│   └── generate.py         # Data generation CLI
├── data/                   # Generated datasets (git-ignored)
├── docs/                   # Generated reports (GitHub Pages)
│   ├── index.html          # Interactive benchmark report
│   └── data.json           # Raw benchmark data
├── Makefile
├── requirements.txt
├── README.md
└── FAIRNESS.md             # Benchmark methodology

Data Format

Datasets are stored as Parquet files with the following schemas:

GroupBy Dataset

Column	Type	Description
id1	int64	Low cardinality key (K unique values)
id2	int64	Low cardinality key (K unique values)
id3	int64	Low cardinality key (K unique values)
v1	float64	Value column (normal distribution)
v2	float64	Value column (normal distribution)
v3	float64	Value column (normal distribution)

Join Dataset

Table	Columns	Description
left	id1, id2, v1	Left table (larger)
right	id1, id3, v2	Right table (smaller)

Fairness

See FAIRNESS.md for detailed methodology on how we ensure fair comparisons.

Key principles:

Measure query execution time only (not data loading or result serialization)
Each adapter uses its native timing mechanism where possible
Data is pre-loaded into memory before timing starts
Warmup iterations ensure JIT compilation and cache warming

Contributing

Fork the repository
Create a feature branch
Add your adapter in bench/adapters/
Update this README
Submit a pull request

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RayforceDB Benchmark Suite

Live Results

Quick Start

Adapters

Benchmarks

GroupBy Queries

Join Queries

Sort Queries

Usage

Basic Commands

Running Individual Benchmark Suites

CLI Options

Benchmarking with Local Rayforce Build

Server-Based Adapters (Docker)

Project Structure

Data Format

GroupBy Dataset

Join Dataset

Fairness

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
bench		bench
docs		docs
.gitignore		.gitignore
FAIRNESS.md		FAIRNESS.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

RayforceDB/rayforce-bench

Folders and files

Latest commit

History

Repository files navigation

RayforceDB Benchmark Suite

Live Results

Quick Start

Adapters

Benchmarks

GroupBy Queries

Join Queries

Sort Queries

Usage

Basic Commands

Running Individual Benchmark Suites

CLI Options

Benchmarking with Local Rayforce Build

Server-Based Adapters (Docker)

Project Structure

Data Format

GroupBy Dataset

Join Dataset

Fairness

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages