Beta · 36+ Models · Tollama-native

Forecasting Evaluator
Agent

Benchmark forecasting models. Select the right one for production.
Expanding-window CV, AutoML, leaderboard scoring, and reproducible reports.

Forecasting Evaluator Agent is the evaluation layer of the Tollama stack. It benchmarks statistical, ML, neural, and Tollama-powered foundation models with expanding-window cross-validation, structured scoring, and reproducible reporting.

View on GitHub ↗ How It Works →
// What It Evaluates
36+ models.
One evaluation pipeline.

Compare statistical baselines, intermittent-demand methods, ML models, neural architectures, and Tollama-served TSFMs under the same benchmarking protocol.

36+ Supported Models
CV Expanding-window
AutoML Best-model selection
HTML/PDF Rich evaluation reports
REST API + Campaign mode
Why another evaluation tool?

Forecasting teams repeat the same evaluation loop on every dataset — ingestion, fold design, model fitting, scoring, and reporting. The work is slow, manual, and difficult to standardize.

01 / NOTEBOOK SPRAWL
Every benchmark becomes a one-off project.
Custom scripts, inconsistent preprocessing, and ad hoc metrics make results hard to compare and harder to trust.
02 / UNFAIR EVALUATION
Models are not tested the same way.
When folds, metrics, or data handling change per experiment, leaderboards stop being defensible.
03 / MODEL SELECTION GUESSWORK
The best model is dataset-dependent.
Dense demand, intermittent demand, multiseries retail, and TSFM-friendly workloads need different model families.
04 / REPORTING FRICTION
Results do not flow downstream.
Teams need outputs they can share, audit, version, and wire into production checks.
// Model Coverage
Four model families.
One consistent benchmark.
📊
Statistical Models

Fast baselines and established forecasting methods for robust comparison across standard workloads. ARIMA, ETS, Theta, CES, MSTL, and more.

🌲
ML Models

Feature-based forecasters for tabular-style time-series problems and operational datasets. LightGBM, XGBoost, and gradient boosting frameworks.

🧠
Neural Models

Deep forecasting architectures for higher-capacity sequence modeling and multiseries learning. LSTM, N-BEATS, TFT, PatchTST, TimesNet.

Foundation Models via tollama

Benchmark Chronos, TimesFM, Moirai, Granite TTM, and other Tollama-served TSFMs in the same loop as classical and neural baselines.

// How It Works
Drop in a CSV.
Get a ranked leaderboard.

Run one command or call the API. The evaluator profiles the data, runs expanding-window cross-validation, scores each model, and exports reproducible artifacts.

STEP 01
Ingest

Auto-detect long and wide CSV layouts, normalize the data, and profile frequency, series length, missingness, and intermittency.

STEP 02
Benchmark

Run expanding-window cross-validation with consistent folds, metrics, and timeouts across selected model families.

STEP 03
Rank

Aggregate accuracy, uncertainty, runtime, and stability into a structured leaderboard.

STEP 04
Recommend

Use AutoML to narrow the candidate set based on data profile and intermittent-demand behavior.

STEP 05
Export

Write machine-readable outputs and visual reports for reviews, audits, and regression tracking.

// Outputs
Structured outputs.
Ready for teams and systems.

Each run produces files that work both for humans and downstream tooling.

📋
results.json

Machine-readable benchmark results with a stable schema for dashboards and pipelines.

🔍
details.json

Forecast traces and diagnostics for reproducibility and postmortem analysis.

📊
report.html

Interactive visual report with leaderboard, per-fold breakdown, and per-series views.

📄
report.pdf

Optional export for review workflows and stakeholder sharing.

// Developer Interfaces
4 ways to access.
One evaluation layer.

Use the interface that fits your workflow — from one-off local benchmarking to remote execution across multiple datasets.

⌨️
CLI

Run reproducible benchmarks from a single command, with model filters, config files, report generation, and diagnostics.

🐍
Python SDK

Embed evaluation workflows in notebooks, services, and internal tooling with a fluent Python interface.

🌐
REST API

Start a benchmark server and trigger remote runs through structured endpoints for benchmark, status, results, and health.

🔄
Campaign Mode

Evaluate a directory of datasets and generate portfolio-level summaries across many benchmark jobs.

Python-first evaluation
infrastructure.

Built on the Tollama ecosystem with production-grade Python tooling for reproducible, scalable model evaluation.

  • Python 3.10+ Core runtime
  • PyPI pip install tollama-eval
  • StatsForecast Statistical model families
  • NeuralForecast Deep learning model families
  • Tollama TSFM runtime integration
  • FastAPI REST API for evaluation jobs
  • Jinja2 + WeasyPrint HTML & PDF reports
bash · tollama-eval CLI CLI
# Install
pip install tollama-eval

# Run evaluation on a dataset
tollama-eval run \
  --data sales.csv \
  --horizon 7 \
  --models auto \
  --cv expanding \
  --folds 5

# Generate report
tollama-eval report \
  --run results/run_001 \
  --format html pdf

# Campaign mode (batch)
tollama-eval campaign \
  --config campaigns/q1_eval.yaml
// Tollama Integration
tollama is the runtime.
This is the evaluation layer.

Use tollama to pull and serve TSFMs through one API. Use Forecasting Evaluator Agent to compare those models against statistical, ML, and neural alternatives — then select the right model for production.

Unified TSFM Access

Pass a tollama server URL and model list to include foundation models in the same benchmark run.

⚖️
Fair Comparison

Evaluate TSFMs beside non-foundation baselines under the same folds, metrics, and report format.

🚀
Production Selection

Turn model comparison into a repeatable workflow with config files, API endpoints, and version-stable outputs.

// Roadmap
What comes next.

The next step is continuous evaluation — moving from one-off benchmarking to persistent model governance.

IN PROGRESS
Continuous Evaluation
Scheduled regression checks, benchmark baselines, and release gating for forecasting systems.
PLANNED
Evaluator Cloud
Hosted runs, team workspaces, and shareable leaderboard views.
PLANNED
Evaluation Narratives
Auto-generated benchmark summaries for engineers, analysts, and decision-makers.
// Get Started

Know which model
actually works.

Production-ready infrastructure for rigorous forecasting model evaluation.

View on GitHub ↗ ← Back to Tollama AI