Case Study · LensHH-LT

One kernel, laptop to H100 — merit-function throughput across hardware

LensHH-LT evaluates the same merit function — bit for bit — on a fanless laptop, a MacBook Air, gaming desktops, and rented A100 / H100 GPUs. The GPU value pre-screen treats every candidate lens as one thread, reaching hundreds of thousands of merit evaluations per second; the CPU runs the analytic-Jacobian polish. Across eight machines and three merit functions, the GPU result matches the CPU to the floating-point noise floor.

8 Systems benchmarked (laptop → H100)
~446k/s Peak merit evaluations (H100, GPU pre-screen)
≤ 6.6×10⁻¹⁴ CPU↔GPU agreement — bit-identical

The benchmark

MeritEvalBench times the primitives the optimizer is built from, with no LM loop in the way. It measures whole-merit value throughput — how many candidate designs per second the engine can score — on the CPU (all cores) and on the GPU, and it separately times the analytic Jacobian (residuals plus the full derivative matrix in one pass) on the CPU. The GPU and CPU paths call the same compiled evaluate_merit, so the GPU result is bit-identical to the CPU by construction. The GPU is used as a value-only pre-screen — a wide Multistart sieve — while the Levenberg–Marquardt polish, with analytic derivatives, runs on the CPU.

Eight systems

The identical Linux / Windows / macOS binary, exercised from a low-power laptop up to a datacenter H100. GPU support ships as a prebuilt native cubin (sm_80 / 89 / 90 / 120), so there is no CUDA toolkit to install on the target and no multi-minute first-run compile — the RTX 5070 Ti below ran with no SDK installed at all.

SystemCPUGPUGPU architecture (SMs)Memory bandwidth
LaptopCore i7-1255U (12T)— (Iris Xe)
MacBook AirApple Silicon (ARM64)— (integrated)
Mini-desktopRyzen 3 5300G (8T)— (Radeon)
DesktopCore i5-13400F (16T)RTX 4060Ada sm_89 (24)272 GB/s
WorkstationRyzen 7 9800X3D (16T)RTX 5070 TiBlackwell sm_120 (70)~896 GB/s
Cloud16 vCPUA100 40GBAmpere sm_80 (108)~1.55 TB/s
Cloud24 vCPUH100 80GBHopper sm_90 (132)~3.35 TB/s
Cloud28 vCPURTX PRO 6000Blackwell sm_120 (188)~1.79 TB/s

Three merit functions

The three test designs span a wide range of merit-function size — and the GPU advantage grows with that size, because the device hides memory latency behind more parallel work.

All three: three wavelengths, three fields, infinite conjugate. The imaging lens is Example 2 of U.S. Patent US 12,321,041 B2 (Tanabe et al.).

Merit evaluations per second

Bar charts of merit evaluations per second on CPU (all cores) versus GPU pre-screen, across eight systems and three lenses, log scale
Merit evaluations / second, CPU (all cores) vs GPU value pre-screen, per lens (log scale). GPU output is bit-identical to the CPU on every card.

On the GPU, every candidate lens is a thread: the device scores tens of thousands of designs in one launch. An H100 reaches roughly 446k merit evaluations per second on the Cooke triplet and 65k/second on the heaviest (WAVEX + SENS) merit; even a consumer RTX 5070 Ti clears 142k and 15k. The printed CPU multiples depend on how many cores the host has, so the figures above lead with absolute throughput — the number that actually scales with the hardware.

Bandwidth-bound, not FP64-bound

The whole-merit kernel is bound by memory bandwidth, not double-precision compute. The clearest evidence: the consumer-class Blackwell card matches the A100 within about 1% on the heaviest merit (37k vs 37k evaluations/second) despite having roughly 14× less FP64 throughput. Across all four GPUs the ranking tracks memory bandwidth — from the 4060's 272 GB/s up to the H100's HBM3 — not the FP64 spec. In practice that means a well-chosen consumer GPU delivers datacenter-class merit throughput for this workload.

Analytic derivatives on the CPU

Bar chart of C++ analytic-Jacobian evaluations per second per CPU, grouped by lens, log scale
C++ merit + analytic derivatives: one residuals + full-Jacobian evaluation per second, per CPU (log scale).

The Levenberg–Marquardt polish needs the Jacobian, and LensHH-LT computes it analytically — residuals and the full derivative matrix in a single forward-mode dual-number pass, far faster than finite differences. This is the CPU's job in the hybrid pipeline: the GPU sieves a wide field of candidates by value, then the CPU refines the survivors with exact derivatives. The chart shows that single-eval cost scales with the host CPU across every machine in the sweep.

What's worth taking away

Try it on your hardware

LensHH-LT runs on Windows, Linux, and macOS, and the GPU pre-screen engages automatically when a CUDA device is present. Download the free trial and run a Multistart on your own designs.

Download free trial Back to LensHH-LT