Benchmark Engine · 36+ Models · Tollama-native

tollama-eval
benchmark engine

Deeper benchmark evidence for Tollama Core.
Cross-validated leaderboards, rich reports, and stable artifacts for model selection.

tollama-eval is the richer benchmark layer behind the Tollama Core workflow. Use Core for the thin preprocess -> forecast -> benchmark -> route loop. Use tollama-eval when you need broader model coverage, richer artifacts, and deeper evaluation analysis.

View on GitHub ↗ How It Works →

// What It Evaluates

36+ models.
One evaluation pipeline.

Compare statistical baselines, intermittent-demand methods, ML models, neural architectures, and Tollama-served TSFMs under the same benchmarking protocol.

36+ Supported Models

CV Expanding-window

AutoML Best-model selection

HTML/PDF Rich evaluation reports

REST API + Campaign mode

// The Problem

Why another evaluation tool?

Forecasting teams repeat the same evaluation loop on every dataset — ingestion, fold design, model fitting, scoring, and reporting. The work is slow, manual, and difficult to standardize.

01 / NOTEBOOK SPRAWL

Every benchmark becomes a one-off project.

Custom scripts, inconsistent preprocessing, and ad hoc metrics make results hard to compare and harder to trust.

02 / UNFAIR EVALUATION

Models are not tested the same way.

When folds, metrics, or data handling change per experiment, leaderboards stop being defensible.

03 / MODEL SELECTION GUESSWORK

The best model is dataset-dependent.

Dense demand, intermittent demand, multiseries retail, and TSFM-friendly workloads need different model families.

04 / REPORTING FRICTION

Results do not flow downstream.

Teams need outputs they can share, audit, version, and wire into production checks.

// Model Coverage

Four model families.
One consistent benchmark.

📊

Statistical Models

Fast baselines and established forecasting methods for robust comparison across standard workloads. ARIMA, ETS, Theta, CES, MSTL, and more.

🌲

ML Models

Feature-based forecasters for tabular-style time-series problems and operational datasets. LightGBM, XGBoost, and gradient boosting frameworks.

🧠

Neural Models

Deep forecasting architectures for higher-capacity sequence modeling and multiseries learning. LSTM, N-BEATS, TFT, PatchTST, TimesNet.

⚡

Foundation Models via tollama

Benchmark Chronos, TimesFM, Moirai, Granite TTM, and other Tollama-served TSFMs in the same loop as classical and neural baselines.

// How It Works

Drop in a CSV.
Get a ranked leaderboard.

Run one command or call the API. The evaluator profiles the data, runs expanding-window cross-validation, scores each model, and exports reproducible artifacts.

STEP 01

Ingest

Auto-detect long and wide CSV layouts, normalize the data, and profile frequency, series length, missingness, and intermittency.

STEP 02

Benchmark

Run expanding-window cross-validation with consistent folds, metrics, and timeouts across selected model families.

STEP 03

Rank

Aggregate accuracy, uncertainty, runtime, and stability into a structured leaderboard.

STEP 04

Recommend

Use AutoML to narrow the candidate set based on data profile and intermittent-demand behavior.

STEP 05

Export

Write machine-readable outputs and visual reports for reviews, audits, and regression tracking.

// Outputs

Structured outputs.
Ready for teams and systems.

Each run produces files that work both for humans and downstream tooling.

📋

results.json

Machine-readable benchmark results with a stable schema for dashboards and pipelines.

🔍

details.json

Forecast traces and diagnostics for reproducibility and postmortem analysis.

📊

report.html

Interactive visual report with leaderboard, per-fold breakdown, and per-series views.

📄

report.pdf

Optional export for review workflows and stakeholder sharing.

// Developer Interfaces

Use when Core needs
deeper evidence.

The thin Core bundle is enough for routing and local evidence loops. Reach for tollama-eval when you need larger model sweeps, richer reports, and campaign-style benchmarking.

⌨️

CLI

Run reproducible benchmarks from a single command, with model filters, config files, report generation, and diagnostics.

🐍

Python SDK

Embed evaluation workflows in notebooks, services, and internal tooling with a fluent Python interface.

🌐

REST API

Start a benchmark server and trigger remote runs through structured endpoints for benchmark, status, results, and health.

🔄

Campaign Mode

Evaluate a directory of datasets and generate portfolio-level summaries across many benchmark jobs.

// Tech Stack

Python-first evaluation
infrastructure.

Built on the Tollama ecosystem with production-grade Python tooling for reproducible, scalable model evaluation.

Python 3.10+ Core runtime
PyPI pip install tollama-eval
StatsForecast Statistical model families
NeuralForecast Deep learning model families
Tollama TSFM runtime integration
FastAPI REST API for evaluation jobs
Jinja2 + WeasyPrint HTML & PDF reports

// Quick Start

          bash · tollama-eval CLI
          CLI
        
# Install

pip install tollama-eval

# Run evaluation on a dataset

tollama-eval run \

  --data sales.csv \

  --horizon 7 \

  --models auto \

  --cv expanding \

  --folds 5

# Generate report

tollama-eval report \

  --run results/run_001 \

  --format html pdf

# Campaign mode (batch)

tollama-eval campaign \

  --config campaigns/q1_eval.yaml

// Tollama Integration

tollama is the Core runtime.
This is the richer benchmark layer.

Use tollama Core to preprocess, forecast, benchmark, and route with the thin artifact set. Use tollama-eval to compare those models against statistical, ML, and neural alternatives with richer outputs like results.json, details.json, and report.html.

⚡

Unified TSFM Access

Pass a tollama server URL and model list to include foundation models in the same benchmark run.

⚖️

Fair Comparison

Evaluate TSFMs beside non-foundation baselines under the same folds, metrics, and report format.

🚀

Production Selection

Turn model comparison into a repeatable workflow with config files, API endpoints, and version-stable outputs.

// Roadmap

What comes next.

The next step is continuous evaluation — moving from one-off benchmarking to persistent benchmark evidence and release gating.

IN PROGRESS

Continuous Evaluation

Scheduled regression checks, benchmark baselines, and release gating for forecasting systems.

PLANNED

Evaluator Cloud

Hosted runs, team workspaces, and shareable leaderboard views.

PLANNED

Evaluation Narratives

Auto-generated benchmark summaries for engineers, analysts, and decision-makers.

tollama-eval benchmark engine

Know which modelactually works.

tollama-eval
benchmark engine

Know which model
actually works.