Benchmark forecasting models. Select the right one for production.
Expanding-window CV, AutoML, leaderboard scoring, and reproducible reports.
Forecasting Evaluator Agent is the evaluation layer of the Tollama stack. It benchmarks statistical, ML, neural, and Tollama-powered foundation models with expanding-window cross-validation, structured scoring, and reproducible reporting.
Compare statistical baselines, intermittent-demand methods, ML models, neural architectures, and Tollama-served TSFMs under the same benchmarking protocol.
Forecasting teams repeat the same evaluation loop on every dataset — ingestion, fold design, model fitting, scoring, and reporting. The work is slow, manual, and difficult to standardize.
Fast baselines and established forecasting methods for robust comparison across standard workloads. ARIMA, ETS, Theta, CES, MSTL, and more.
Feature-based forecasters for tabular-style time-series problems and operational datasets. LightGBM, XGBoost, and gradient boosting frameworks.
Deep forecasting architectures for higher-capacity sequence modeling and multiseries learning. LSTM, N-BEATS, TFT, PatchTST, TimesNet.
Benchmark Chronos, TimesFM, Moirai, Granite TTM, and other Tollama-served TSFMs in the same loop as classical and neural baselines.
Run one command or call the API. The evaluator profiles the data, runs expanding-window cross-validation, scores each model, and exports reproducible artifacts.
Auto-detect long and wide CSV layouts, normalize the data, and profile frequency, series length, missingness, and intermittency.
Run expanding-window cross-validation with consistent folds, metrics, and timeouts across selected model families.
Aggregate accuracy, uncertainty, runtime, and stability into a structured leaderboard.
Use AutoML to narrow the candidate set based on data profile and intermittent-demand behavior.
Write machine-readable outputs and visual reports for reviews, audits, and regression tracking.
Each run produces files that work both for humans and downstream tooling.
Machine-readable benchmark results with a stable schema for dashboards and pipelines.
Forecast traces and diagnostics for reproducibility and postmortem analysis.
Interactive visual report with leaderboard, per-fold breakdown, and per-series views.
Optional export for review workflows and stakeholder sharing.
Use the interface that fits your workflow — from one-off local benchmarking to remote execution across multiple datasets.
Run reproducible benchmarks from a single command, with model filters, config files, report generation, and diagnostics.
Embed evaluation workflows in notebooks, services, and internal tooling with a fluent Python interface.
Start a benchmark server and trigger remote runs through structured endpoints for benchmark, status, results, and health.
Evaluate a directory of datasets and generate portfolio-level summaries across many benchmark jobs.
Built on the Tollama ecosystem with production-grade Python tooling for reproducible, scalable model evaluation.
Use tollama to pull and serve TSFMs through one API. Use Forecasting Evaluator Agent to compare those models against statistical, ML, and neural alternatives — then select the right model for production.
Pass a tollama server URL and model list to include foundation models in the same benchmark run.
Evaluate TSFMs beside non-foundation baselines under the same folds, metrics, and report format.
Turn model comparison into a repeatable workflow with config files, API endpoints, and version-stable outputs.
The next step is continuous evaluation — moving from one-off benchmarking to persistent model governance.
Production-ready infrastructure for rigorous forecasting model evaluation.