Thinking Machines Lab Unveils Research Into AI Model Consistency

Thinking Machines Lab, founded by former OpenAI CTO Mira Murati, released its inaugural research blog post on September 10, marking the startup’s public debut after securing a record‐breaking $2 billion seed funding round earlier this year. The post, authored by scientist Horace He and published on the company’s Connectionism blog, tackles a critical issue in large language model (LLM) inference: nondeterministic outputs when identical prompts are issued (Connectionism Blog).

The research addresses how inconsistency in AI responses undermines enterprise reliability and reproducibility. By focusing on the underlying computational processes rather than surface‐level randomness, Thinking Machines Lab positions itself at the forefront of efforts to improve AI model trustworthiness and efficiency.

Identifying the Root of AI Inconsistency

He’s study, “Defeating Nondeterminism in LLM Inference,” challenges the prevailing assumption that concurrent GPU threading causes response variability. Instead, he demonstrates that the ordering of floating‐point operations across different batch sizes leads to divergent yet mathematically valid outputs. Using Qwen’s 235 billion‐parameter model, He generated 1,000 completions to the same prompt and found 80 unique responses, with divergence emerging at the 103rd token.

The analysis reveals that GPU kernel execution—specifically how reduction operations are scheduled—varies based on the number of simultaneous requests. This orchestration difference, not sampling randomness, is the primary driver of nondeterminism, as nodes process floating‐point arithmetic in varying sequences.

Proposed Solutions for Deterministic Inference

To achieve consistent outputs, He proposes making transformer kernels “batch‐invariant” by enforcing a fixed reduction order regardless of batch size. Key operations targeted include RMSNorm, matrix multiplication, and attention mechanisms. The team has released demonstration code built on the open‐source vLLM framework, showcasing deterministic inference with a performance overhead of approximately 60%, unoptimized for speed.

Beyond enterprise use cases demanding reproducibility, the approach promises to streamline reinforcement learning workflows by aligning numerical behavior between training and inference, potentially accelerating convergence and reducing resource waste.

Silicon Valley’s Newest AI Research Lab

Thinking Machines Lab emerged from stealth in July with a $2 billion seed round led by Andreessen Horowitz, valuing the company at $12 billion and drawing strategic participation from NVIDIA, AMD, and Cisco (Reuters). The team includes former OpenAI luminaries such as John Schulman and Barrett Zoph, emphasizing Murati’s vision of open scientific collaboration in contrast to the more closed posture of larger AI firms.

With plans to release a multimodal AI product featuring an open‐source component, the lab commits to regular publication of technical findings, code, and papers—reinforcing its motto, “science is better when shared.”