Forge turns PyTorch models into optimized CUDA and Triton kernels automatically. 32 AI agents run in parallel, each trying different optimization strategies like tensor cores, memory coalescing, and kernel fusion. A judge validates every kernel for correctness before benchmarking. We got 5x faster inference than torch.compile on Llama 3.1 8B and 4x on Qwen 2.5 7B. Works on any PyTorch model. Free trial on one kernel. Full credit refund if we don't beat torch.compile.
Hey PH!
If we don't beat torch.compile
you get your credits back!!
Real results on B200:
Llama 3.1 8B: 5x faster than torch.compile
Qwen 2.5 7B: 4x faster
SDXL UNet: 3x faster
Turning “PyTorch in, tuned CUDA/Triton out” into something productized like this is a very ambitious swing, especially with 32 agents coordinating on the same kernel. The hardest part of these systems in my experience is not just finding a faster variant once, but keeping the optimized kernels robust across driver changes, new GPUs and slightly different input shapes without a constant babysitting loop.
How are you handling that stability vs. raw speed tradeoff in the UX: do you bias toward conservative, portable kernels by default, or lean into aggressive, hardware-specific wins and let power users manage the risk?
32 parallel coder+judge pairs is a smart setup. The judge comparison logic is the interesting part... wondering if it just checks against torch.compile baseline or if you can define custom metrics like memory footprint or specific tensor core utilization targets.
Correctness is the main risk with generated kernels. What is your validation strategy beyond “matches reference outputs”—e.g., tolerances, randomized testing across shapes/dtypes, determinism, and how you debug/report failures so users can trust and iterate quickly?
About Forge Agent on Product Hunt
“Swarm Agents That Turn Slow PyTorch Into Fast GPU Kernels”
Forge Agent launched on Product Hunt on January 23rd, 2026 and earned 108 upvotes and 6 comments, placing #9 on the daily leaderboard. Forge turns PyTorch models into optimized CUDA and Triton kernels automatically. 32 AI agents run in parallel, each trying different optimization strategies like tensor cores, memory coalescing, and kernel fusion. A judge validates every kernel for correctness before benchmarking. We got 5x faster inference than torch.compile on Llama 3.1 8B and 4x on Qwen 2.5 7B. Works on any PyTorch model. Free trial on one kernel. Full credit refund if we don't beat torch.compile.
Forge Agent was featured in Hardware (11.4k followers), Developer Tools (511k followers) and Artificial Intelligence (466.2k followers) on Product Hunt. Together, these topics include over 155.5k products, making this a competitive space to launch in.
Who hunted Forge Agent?
Forge Agent was hunted by Jaber Jaber. A “hunter” on Product Hunt is the community member who submits a product to the platform — uploading the images, the link, and tagging the makers behind it. Hunters typically write the first comment explaining why a product is worth attention, and their followers are notified the moment they post. Around 79% of featured launches on Product Hunt are self-hunted by their makers, but a well-known hunter still acts as a signal of quality to the rest of the community. See the full all-time top hunters leaderboard to discover who is shaping the Product Hunt ecosystem.
Want to see how Forge Agent stacked up against nearby launches in real time? Check out the live launch dashboard for upvote speed charts, proximity comparisons, and more analytics.