Meta's Superintelligence Ambition: A Deep Dive into Leadership, Compute, Talent, and the Llama 4 Lessons
Meta's recent acquisition of a 49% stake in Scale AI, valued at approximately $30 billion, sent shockwaves through the tech industry. It underscored a critical truth: for a company generating over $100 billion in annual cash flow from its advertising empire, financial constraints are secondary to strategic imperatives. Despite this seemingly unlimited war chest, Meta has found itself lagging behind pure-play AI foundation labs in the race for cutting-edge model performance.
The true catalyst for Meta's intensified focus was the moment it ceded its leadership position in open-weight models to competitors like DeepSeek. This loss of ground served as a wake-up call, prompting a fundamental shift in strategy. Now, firmly in 'Founder Mode,' Mark Zuckerberg has personally taken the helm, identifying two primary bottlenecks hindering Meta's progress toward superintelligence: Talent and Compute. As one of the few remaining founders still steering a tech behemoth, Zuckerberg is not one to shy away from bold moves, even if it means diverting capital from traditional avenues like stock buybacks to fuel future innovation.

Beyond simply allocating capital, Zuckerberg is orchestrating a fundamental overhaul of Meta's approach to generative AI. This includes establishing a new 'Superintelligence' team from the ground up and aggressively recruiting top AI talent with compensation packages that dwarf typical tech salaries. Reports indicate offers for key individuals on this team can reach $200 million over four years – a staggering 100 times the compensation of their peers. While not every offer, including some reportedly in the billion-dollar range for leadership at competitors like OpenAI, has been accepted, this strategy is undeniably disrupting the talent market and significantly increasing the cost of retaining top AI researchers for rivals.
Perhaps even more indicative of this radical shift is Zuckerberg's decision to discard Meta's long-standing datacenter construction playbook. Inspired by the rapid deployment strategies seen elsewhere in the industry, Meta is now building multi-billion-dollar GPU clusters in unconventional structures dubbed 'Tents.' This move signals a prioritization of speed and deployment velocity over traditional datacenter aesthetics and redundancy.

This report unpacks Meta's unprecedented reinvention across Compute, Talent, and Data in its aggressive pursuit of Superintelligence. We trace the journey from the open-source dominance of Llama 3 to the challenges faced by the Llama 4 'Behemoth' model. While the company may have stumbled, it is far from out. In fact, Meta's projected increase in training FLOPS is poised to rival even that of leading labs like OpenAI, rapidly transforming the company from GPU-constrained to having an abundance of compute resources per researcher.

Meta GenAI 1.0: AI Incrementalism
For years, major tech companies like Meta and Google pursued an 'AI Incrementalism' strategy. Rather than focusing on developing groundbreaking, general-purpose AI models from scratch, their primary focus was on leveraging AI to enhance existing products and services. For Meta, this meant deploying sophisticated AI and machine learning models to improve core functionalities such as recommendation algorithms for feeds (Facebook, Instagram, Threads), optimizing ad targeting for increased revenue, automating content moderation and tagging, and building internal tools to boost employee productivity.
This approach yielded significant financial dividends. Meta successfully navigated challenges like Apple's App Tracking Transparency (ATT) feature, which aimed to limit user tracking. By improving its on-platform AI capabilities, Meta could maintain and even enhance the effectiveness of its advertising systems despite reduced access to off-platform user data. The financial results demonstrated the success of this strategy, contributing substantially to Meta's robust cash flow.

However, this incremental approach also meant that Meta's large language model (LLM) efforts, while significant, didn't always prioritize frontier research aimed at achieving artificial general intelligence (AGI) or superintelligence in the same way that pure-play labs like OpenAI or Anthropic did. Capital allocation was primarily directed towards supporting and optimizing the core business, as highlighted in Meta's own statements:
“Our CapEx growth this year is going toward both generative AI and core business needs with the majority of overall CapEx supporting the core.”
Source: Meta Q1 2025 earnings call, emphasis SemiAnalysis
This strategic choice meant that while Meta built impressive internal AI capabilities, it didn't possess the same existential drive to dominate entirely new AI-native use cases, such as conversational AI chatbots (OpenAI's ChatGPT) or advanced coding assistants (Anthropic's focus). This difference in focus is starkly visible when comparing the allocation of resources, particularly human capital, between Meta and a leading AI foundation lab like OpenAI. The intense competition for top AI researchers, exacerbated by Meta's aggressive recruitment tactics and inflated salaries, underscores the shift in priorities.

Consequently, when evaluating the traction of consumer-facing generative AI applications, Meta's offerings have historically trailed behind the reach and engagement of ChatGPT. While Meta AI is integrated into its vast ecosystem, it hasn't yet captured the public imagination or achieved the standalone prominence of OpenAI's flagship product.

However, this landscape is rapidly evolving. Leveraging proprietary industry models, we forecast a significant escalation in Meta's generative AI investment in the coming years. This isn't just a gradual increase; it's a strategic pivot aimed at closing the gap and competing directly at the frontier of AI capabilities.

Meta GenAI 2.0 – Part 1, Re-Inventing the Datacenter Strategy (Again)
The pursuit of superintelligence demands an unprecedented scale of computational power. Recognizing this, Meta has embarked on a radical transformation of its datacenter strategy. Just a year prior, the company had already moved away from its traditional, decade-old 'H'-shaped datacenter blueprint, opting for a new design specifically optimized for AI workloads. This shift aimed to improve power density, cooling efficiency, and network topology to better support massive GPU clusters.

Now, in 2025, Zuckerberg has decided to reinvent the strategy yet again. This latest pivot is heavily influenced by the rapid time-to-market demonstrated by competitors like xAI. Meta is adopting a datacenter design philosophy that prioritizes speed of deployment above all other considerations. They are not just planning these new sites; they are actively building them at an astonishing pace. This move is likely to surprise traditional datacenter and real estate investors who were already adjusting to the speed and scale of projects like xAI's Memphis site.
From Buildings to Tents
The most visible manifestation of this new strategy is the construction of multi-billion-dollar GPU clusters within structures referred to internally as 'Tents.' This design is not focused on architectural elegance or maximizing traditional redundancy metrics. Instead, its singular purpose is to bring massive amounts of compute power online as quickly as possible.

These structures utilize prefabricated power and cooling modules and ultra-light building materials to drastically reduce construction time. The emphasis on speed is so pronounced that these sites reportedly lack traditional backup generation, such as diesel generators. This means they are entirely reliant on the grid connection, a significant departure from standard datacenter practices that prioritize uninterrupted power supply.
Power for these 'Tent' datacenters is currently sourced from nearby Meta on-site substations. This requires sophisticated workload management systems to ensure maximum utilization of available grid power, potentially necessitating the curtailment or shutdown of less critical workloads during periods of peak demand or grid stress, such as hot summer days when air conditioning loads are high.

The Prometheus 1GW AI Training Cluster – An “All Of The Above” Infrastructure Strategy
In parallel with the 'Tent' deployments, Meta is also constructing one of the world's largest dedicated AI training clusters in Ohio. Internally codenamed 'Prometheus,' this cluster represents an 'all of the above' approach to infrastructure, combining various strategies to maximize compute capacity and minimize deployment time.
The Prometheus cluster integrates:
- Self-build campuses: Traditional large-scale datacenter construction projects initiated and managed directly by Meta.
- Leasing from third parties: Securing capacity from wholesale datacenter providers to accelerate deployment without the lead time of ground-up construction.
- AI-optimized designs: Incorporating lessons learned from previous AI-specific datacenter blueprints to maximize efficiency for GPU workloads.
- Multi-datacenter-campus training: Connecting geographically distributed sites with high-bandwidth networks to function as a single, massive training cluster.
- On-site, behind-the-meter natural gas generation: Deploying dedicated power plants to supplement grid supply and ensure power availability for critical AI workloads.
Sources indicate that Meta is connecting these disparate sites with ultra-high-bandwidth networks, forming a unified backend powered by advanced networking equipment like Arista 7808 Switches utilizing Broadcom Jericho and Ramon ASICs. This sophisticated network fabric is essential for enabling efficient distributed training across multiple physical locations.

The combination of self-build and leased capacity allows Meta to accelerate its compute ramp significantly. In the latter half of 2024, Meta reportedly pre-leased more datacenter capacity than any other hyperscaler, with a substantial portion located in Ohio, specifically for the Prometheus project.

Furthermore, to overcome limitations in the local power grid infrastructure, Meta, in collaboration with partners like Williams, is constructing two 200MW on-site natural gas power plants. The equipment list for the first plant includes a mix of turbines and reciprocating engines, highlighting a diverse approach to power generation tailored for datacenter needs. This move into behind-the-meter generation is a significant development with potential implications for traditional power suppliers and grid operators, as well as companies specializing in on-site power solutions.

The ability to leverage large, distributed datacenters asynchronously is particularly important with the increasing adoption of techniques like reinforcement learning in AI training. This allows models to be continuously improved using compute resources spread across multiple locations, contributing to overall model intelligence through post-training processes.
While OpenAI has demonstrated significant compute advantage, Meta's second frontier cluster, Hyperion, is designed specifically to close that gap and potentially surpass rivals in sheer scale.

Beating Stargate at Scale: Meta’s Hyperion 2GW Cluster
While much public attention has been focused on high-profile projects like the Stargate datacenter in Abilene, Texas, Meta has been quietly planning and executing its own massive response for over a year. The result is a colossal cluster under construction in Louisiana, internally referred to as 'Hyperion.' By the end of 2027, Hyperion is projected to become the world's largest individual datacenter campus, with over 1.5GW of IT power capacity in its initial phase alone, ultimately targeting 2GW.

Meta broke ground on the Hyperion site at the end of 2024 and has been rapidly progressing on both the necessary power infrastructure and the datacenter campus itself. This aggressive timeline and massive scale underscore Meta's determination to possess the computational foundation required for frontier AI research and deployment.

It's important to note that Prometheus and Hyperion are just two examples of Meta's extensive datacenter buildout. The company has numerous other sites under construction and ramping up capacity globally. A comprehensive view of Meta's AI datacenters, including expected completion dates and power capacities, is available in detailed industry models.
Llama 4 Failure – From Open-Source Prince to Behemoth Pauper
Before delving further into Meta's talent strategy for Superintelligence, it's crucial to understand the context that precipitated this aggressive pivot. Meta had established itself as a leader in the open-source AI community with the success of its Llama 3 model family. However, the subsequent attempt, Llama 4 (internally dubbed 'Behemoth'), reportedly fell short of expectations, losing the open-source frontier lead to models from labs like China's DeepSeek.

Based on technical analysis and industry insights, several key factors likely contributed to the challenges faced during the Llama 4 training run:
- Chunked attention: An architectural choice that may have hindered long-range reasoning.
- Expert choice routing: A Mixture-of-Experts routing strategy that presented inference challenges.
- Pretraining data quality: Issues with the scale and cleanliness of the training data.
- Scaling strategy and coordination: Difficulties in effectively scaling research experiments and managing the large training effort.
Chunked Attention
Attention mechanisms are fundamental to transformer models like LLMs, allowing them to weigh the importance of different tokens in the input sequence. However, the computational cost of standard causal attention scales quadratically with the number of tokens. This becomes a significant bottleneck for processing very long contexts.

To address this, researchers have developed more memory-efficient attention mechanisms. Meta's choice for Behemoth was reportedly chunked attention. This method divides the input sequence into fixed-size blocks, processing attention within these blocks. While this reduces the memory footprint and allows for processing longer overall contexts, it introduces a potential drawback.

In chunked attention, the first token within each block lacks direct access to the context from previous blocks. Although global attention layers can help propagate some information across blocks, this discontinuity can create 'blind spots,' particularly at the boundaries between chunks. This architectural limitation can significantly impact the model's ability to perform complex reasoning tasks that require integrating information across long sequences, especially when a chain of thought spans multiple chunks.

In contrast, sliding window attention, used in other successful models, maintains local continuity by sliding the attention window token by token. While still requiring multiple layers to propagate context over very long distances, it avoids the abrupt loss of information at fixed boundaries.

The choice of chunked attention for Behemoth, while seemingly efficient for memory, created these reasoning limitations. Compounding this issue, sources suggest that Meta may not have had sufficiently robust long-context evaluation or testing infrastructure in place to detect this specific failure mode early in the training process. This highlights a gap in their evaluation capabilities compared to labs that specialize in frontier models, a gap the new Superintelligence team is tasked with closing.
Expert Choice Routing
Many state-of-the-art LLMs employ a Mixture of Experts (MoE) architecture. In MoE models, instead of every token passing through every parameter in every layer, a 'router' mechanism directs tokens to a subset of specialized 'experts' (typically feed-forward networks) within each layer. This allows models to scale the number of parameters significantly while keeping the computational cost per token manageable.
There are different strategies for how the router selects experts. Most modern MoE models are trained using token choice routing. In this approach, for each token, the router determines the top K most relevant experts based on a learned scoring mechanism. The token's information is then processed by these K selected experts. This guarantees that every token is processed by the same number of experts (K), ensuring uniform information flow. However, a known challenge is that some experts can become disproportionately popular, leading to load imbalance and under-utilization of other experts. This can degrade training efficiency, especially when using Expert Parallelism (EP), where different experts reside on different GPU nodes, increasing reliance on the slower scale-out network (InfiniBand or RoCE) compared to the faster scale-up network (NVLink) within a server. Top labs mitigate this with load balancing techniques, often involving auxiliary loss functions.

Expert choice routing, a less common alternative introduced by Google in 2022, flips this dynamic. Instead of tokens choosing experts, experts choose tokens. The router determines, for each expert, the top N tokens it will process. This approach inherently promotes load balancing among experts, as each expert is guaranteed to process N tokens, regardless of how 'popular' those tokens are. This can lead to better training efficiency (MFU) across distributed hardware, as the workload is more evenly spread.

However, expert choice routing has a significant drawback, particularly during inference. Inference involves two main stages: Prefill (processing the input prompt) and Decode (generating the output token by token). During the Decode stage, the model processes one token at a time per layer. With expert choice routing, each expert can only select from this very small set of tokens (1 token x batch size per layer). This is vastly different from the training scenario where experts see a large pool of tokens (e.g., 8k sequence length x 16 batch size = 128k tokens). This mismatch between training and inference conditions, coupled with limitations in modern GPU networking that constrain batch sizes, makes inference with expert choice routing economically inefficient and technically challenging.
Sources indicate that the Llama 4 team initially used expert choice routing but switched to token choice routing partway through the training run. This mid-run change likely prevented the experts from specializing effectively under a consistent routing strategy, contributing to the model's suboptimal performance.
Data Quality: A Self-Inflicted Wound
Training frontier LLMs requires not only massive compute but also enormous quantities of high-quality data. Llama 3 405B was trained on 15 trillion tokens, and Llama 4 Behemoth likely required a significantly larger dataset, potentially 3-4 times that amount. Sourcing, cleaning, and preparing such vast datasets is a major bottleneck for AI labs, particularly in the West, where simply copying outputs from other models is not a viable strategy.
Meta had previously relied on publicly available datasets like Common Crawl for Llama 3. For Llama 4 Behemoth, they transitioned to using an internal web crawler they developed. While an internal crawler can potentially provide more control and access to unique data, this transition reportedly backfired during the Llama 4 training. The team struggled to effectively clean and deduplicate the massive new data stream generated by the crawler. Their data processing pipelines, not having been stress-tested at this unprecedented scale, proved inadequate.
Furthermore, unlike many other leading AI labs, including OpenAI and Deepseek, Meta reportedly does not utilize YouTube data in its primary training corpora. YouTube, with its vast repository of video transcripts, lectures, tutorials, and diverse content, represents an incredibly rich source of multimodal data. The absence of this data may have hampered Meta's efforts to build a truly multimodal model and could be a significant disadvantage in developing models with broad world knowledge and reasoning capabilities.
Scaling Experiments
Beyond specific technical choices, the Llama 4 team also faced challenges in effectively scaling research experiments into a full-fledged, multi-billion-dollar training run. Large-scale AI training is an incredibly complex undertaking that requires meticulous planning, rigorous experimentation, and strong coordination.
Sources suggest there were competing research directions within the Llama 4 team, and a lack of clear leadership to decisively choose the most promising path forward. Certain model architecture choices were reportedly included in the large training run without sufficient 'ablations' – smaller, controlled experiments designed to isolate the impact of a specific change. This led to poorly managed scaling ladders, where it was difficult to pinpoint which architectural or data choices were contributing positively or negatively to the model's performance.
The difficulty of scaling experiments is a common challenge in frontier AI research. A well-known anecdote from OpenAI's training of GPT 4.5 illustrates this point. During scaling experiments, the team observed promising improvements in the model's ability to generalize. However, they later discovered that parts of their internal code monorepo, used as a validation dataset, had been inadvertently copied directly from publicly available code. The model wasn't generalizing; it was simply regurgitating memorized code from its training data. This highlights the immense diligence and preparation required to effectively execute large pretraining runs and ensure that observed improvements are genuine indicators of increased intelligence rather than artifacts of data contamination or methodological flaws.
Despite these significant technical hurdles and strategic missteps with Llama 4 Behemoth, not all was lost. Meta was reportedly able to distill the knowledge (logits) from the larger, flawed model into smaller, more efficient pretrained models like Maverick and Scout. This distillation process allowed them to bypass some of the architectural flaws of the larger model and produce usable, albeit not best-in-class for their size, models. While distillation is efficient for creating smaller models, it is fundamentally limited by the capabilities of the source model and is not a substitute for successful large-scale reinforcement learning from human feedback (RLHF) or other post-training techniques that Meta is reportedly still developing.
Meta GenAI 2.0 Pt 2: Bridging the Talent Gap
With the radical infrastructure revamp underway and the technical lessons from the Llama 4 experience deeply absorbed, Meta's GenAI 2.0 strategy pivots decisively to address the second critical ingredient for achieving superintelligence: talent.
Mark Zuckerberg has publicly acknowledged the talent gap between Meta and leading AI labs and has personally taken charge of recruitment efforts for the new Superintelligence initiative. His mission is to assemble a small, elite team characterized by extreme talent density. This involves not just competitive offers but truly unprecedented compensation packages aimed at luring the world's top AI researchers and engineers. As mentioned earlier, typical offers for key individuals are reportedly in the $200 million to $300 million range over four years, a level of compensation designed to make candidates think twice before accepting offers elsewhere.
The strategic goal is to create a 'flywheel effect': attracting a critical mass of top-tier researchers lends immediate credibility to the project, which in turn attracts more talent, creating a virtuous cycle of innovation and momentum. This strategy is already yielding results, with several high-profile figures joining Meta's AI efforts, including Nat Friedman (former GitHub CEO), Alex Wang (former Scale AI CEO), and Daniel Gross (co-founder of SSI, Ilya Sutskever's startup). These hires bring not only exceptional technical and operational expertise but also significant influence and respect within the AI community.
The recruiting pitch Meta can offer is compelling: access to unrivaled computational resources (with the ramp-up of Prometheus and Hyperion), the opportunity to contribute to building potentially the best open-source model family in the world, and immediate access to Meta's vast ecosystem of over 2 billion Daily Active Users for deployment and feedback. Combined with the extraordinary compensation packages, this makes Meta a formidable competitor in the global AI talent war, successfully attracting talent from rivals like OpenAI, Anthropic, and numerous other leading firms.
M&A, Scale AI, etc
Meta's aggressive pursuit of talent and technology has also extended to potential acquisitions. Reports indicated that Zuckerberg made acquisition offers to prominent AI startups like Thinking Machines and SSI (founded by former OpenAI Chief Scientist Ilya Sutskever), though these offers were reportedly declined. While some observers suggested that Meta 'settled' by acquiring a significant stake in Scale AI after being turned down by others, this perspective likely underestimates the strategic value of the Scale AI deal.
As detailed in the analysis of the Llama 4 failure, data quality and evaluation capabilities were significant weaknesses for Meta. The acquisition of a large stake in Scale AI directly addresses these shortcomings. Scale AI is a leader in data annotation, curation, and model evaluation. Alex Wang, Scale AI's former CEO, brings deep expertise in these critical areas, along with key engineering talent, particularly from Scale's SEAL lab.
The SEAL lab is renowned for developing sophisticated model evaluation benchmarks, including the Humanity's Last Exam (HLE), considered one of the top benchmarks for evaluating reasoning abilities in AI models. Integrating Scale AI's technology, data expertise, and evaluation methodologies is a direct and powerful move to strengthen Meta's AI pipeline and address the data quality and evaluation gaps that contributed to previous model challenges. The addition of figures like Nat Friedman and Daniel Gross further bolsters Meta's AI leadership, bringing elite operational experience and deep investment insights from the AI ecosystem.
The More You Buy The More You Save: OBBB Edition
Meta's timing for this massive investment and expansion couldn't be better from a financial perspective, particularly regarding infrastructure. The passage of significant legislative packages, sometimes referred to colloquially as the 'One Big Beautiful Bill' (OBBB), includes substantial tax incentives specifically designed to encourage investment in critical infrastructure, including datacenters and advanced computing facilities. These incentives can significantly accelerate the return on investment for large-scale capital expenditures.
For hyperscalers like Meta, these provisions offer substantial tax benefits for building new datacenters and deploying advanced AI compute infrastructure now. This effectively means that a portion of the massive multi-billion-dollar investments in projects like Prometheus and Hyperion will be offset by government incentives, making the aggressive buildout even more financially attractive. This alignment of corporate strategy with government policy creates a powerful tailwind for Meta's superintelligence ambitions, effectively providing a form of government support for the modern-day equivalent of a 'Manhattan Project' in AI development.
Meta's transformation is a compelling narrative of a tech giant leveraging its immense resources and founder-led urgency to overcome challenges and compete at the very forefront of AI. By radically rethinking its approach to compute infrastructure, aggressively pursuing top talent, and strategically acquiring key capabilities like those offered by Scale AI, Meta is positioning itself to be a dominant force in the race for superintelligence. The lessons learned from past challenges, particularly the technical hurdles faced with Llama 4, are clearly informing this new, high-stakes strategy. The coming years will reveal whether this unprecedented investment and strategic pivot will enable Meta to achieve its ambitious goals and reshape the future of AI.