Deeper benchmark evidence for Tollama Core.
Cross-validated leaderboards, rich reports, and stable artifacts for model selection.
tollama-eval
is the richer benchmark layer behind the Tollama Core workflow. Use Core
for the thin
preprocess -> forecast -> benchmark -> route
loop. Use tollama-eval
when you need broader model coverage, richer artifacts, and deeper
evaluation analysis.
Compare statistical baselines, intermittent-demand methods, ML models, neural architectures, and Tollama-served TSFMs under the same benchmarking protocol.
Forecasting teams repeat the same evaluation loop on every dataset — ingestion, fold design, model fitting, scoring, and reporting. The work is slow, manual, and difficult to standardize.
Fast baselines and established forecasting methods for robust comparison across standard workloads. ARIMA, ETS, Theta, CES, MSTL, and more.
Feature-based forecasters for tabular-style time-series problems and operational datasets. LightGBM, XGBoost, and gradient boosting frameworks.
Deep forecasting architectures for higher-capacity sequence modeling and multiseries learning. LSTM, N-BEATS, TFT, PatchTST, TimesNet.
Benchmark Chronos, TimesFM, Moirai, Granite TTM, and other Tollama-served TSFMs in the same loop as classical and neural baselines.
Run one command or call the API. The evaluator profiles the data, runs expanding-window cross-validation, scores each model, and exports reproducible artifacts.
Auto-detect long and wide CSV layouts, normalize the data, and profile frequency, series length, missingness, and intermittency.
Run expanding-window cross-validation with consistent folds, metrics, and timeouts across selected model families.
Aggregate accuracy, uncertainty, runtime, and stability into a structured leaderboard.
Use AutoML to narrow the candidate set based on data profile and intermittent-demand behavior.
Write machine-readable outputs and visual reports for reviews, audits, and regression tracking.
Each run produces files that work both for humans and downstream tooling.
Machine-readable benchmark results with a stable schema for dashboards and pipelines.
Forecast traces and diagnostics for reproducibility and postmortem analysis.
Interactive visual report with leaderboard, per-fold breakdown, and per-series views.
Optional export for review workflows and stakeholder sharing.
The thin Core bundle is enough for routing and local evidence loops. Reach for tollama-eval when you need larger model sweeps, richer reports, and campaign-style benchmarking.
Run reproducible benchmarks from a single command, with model filters, config files, report generation, and diagnostics.
Embed evaluation workflows in notebooks, services, and internal tooling with a fluent Python interface.
Start a benchmark server and trigger remote runs through structured endpoints for benchmark, status, results, and health.
Evaluate a directory of datasets and generate portfolio-level summaries across many benchmark jobs.
Built on the Tollama ecosystem with production-grade Python tooling for reproducible, scalable model evaluation.
Use tollama Core to preprocess, forecast, benchmark, and route with the thin
artifact set. Use tollama-eval to compare those models against
statistical, ML, and neural alternatives with richer outputs like
results.json,
details.json,
and report.html.
Pass a tollama server URL and model list to include foundation models in the same benchmark run.
Evaluate TSFMs beside non-foundation baselines under the same folds, metrics, and report format.
Turn model comparison into a repeatable workflow with config files, API endpoints, and version-stable outputs.
The next step is continuous evaluation — moving from one-off benchmarking to persistent benchmark evidence and release gating.
Production-ready infrastructure for rigorous forecasting model evaluation.