tilelang.profiler.bench¶
Profiler and benchmarking utilities for PyTorch functions.
Attributes¶
Classes¶
Context manager to suppress stdout and stderr output. |
Functions¶
|
Benchmark the runtime of a PyTorch function with L2 cache management. |
Module Contents¶
- class tilelang.profiler.bench.suppress_stdout_stderr¶
Context manager to suppress stdout and stderr output.
Source: https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/testing/bench.py
- __enter__()¶
- __exit__(*_)¶
- tilelang.profiler.bench.IS_CUDA¶
- tilelang.profiler.bench.device = 'cuda:0'¶
- tilelang.profiler.bench.Event¶
- tilelang.profiler.bench.do_bench(fn, warmup=25, rep=100, _n_warmup=0, _n_repeat=0, quantiles=None, fast_flush=True, backend='event', return_mode='mean')¶
Benchmark the runtime of a PyTorch function with L2 cache management.
This function provides accurate GPU kernel timing by: - Clearing L2 cache between runs for consistent measurements - Auto-calculating warmup and repeat counts based on kernel runtime - Supporting multiple profiling backends (CUDA events or CUPTI) - Offering flexible result aggregation (mean/median/min/max/quantiles)
- Parameters:
fn (Callable) – Function to benchmark
warmup (float) – Target warmup time in milliseconds (default: 25)
rep (float) – Target total benchmark time in milliseconds (default: 100)
_n_warmup (int) – Manual override for warmup iterations (default: 0 = auto)
_n_repeat (int) – Manual override for benchmark iterations (default: 0 = auto)
quantiles (list[float] | None) – Performance percentiles to compute (e.g., [0.5, 0.95])
fast_flush (bool) – Use faster L2 cache flush with int32 vs int8 (default: True)
backend (Literal['event', 'cupti']) – Profiler backend - “event” (CUDA events) or “cupti” (default: “event”)
return_mode (Literal['min', 'max', 'mean', 'median']) – Result aggregation method - “mean”, “median”, “min”, or “max”
- Returns:
Runtime in milliseconds (float) or list of quantile values if quantiles specified
- Return type:
float | list[float]