The Future of AI: It’s About Architecture

Side-by-side images showing advanced robotics: on the left, a humanoid robot with illuminated joints and a sleek metallic design; on the right, a delivery drone carrying a package above two autonomous delivery robots in a minimalist indoor setting
image by manus.im

by Shomit Ghose

Scaling: Thinking Globally and Acting Locally

For much of the past decade, AI progress has been defined by scale: larger models trained on ever greater amounts of data and compute. The paradigm of “more parameters, better performance” has driven notable advances but is now showing signs of saturation. Empirical scaling laws, such as those identified in Chinchilla and follow-on studies, demonstrate diminishing marginal returns as model and dataset sizes increase. The scaling curve has now become superlinear, with the costs of additional performance rising sharply, while qualitative improvements such as coherence, persistence of memory, and multi-step reasoning remain unrealized.

Line chart titled ‘Running Out of Data’ showing the projected intersection between the amount of available text on the Internet and the size of training datasets for large language models (LLMs). The blue line for training data grows exponentially from 2020 to 2028, while the orange line for available text increases slowly. The chart notes that by 2028, the amount of available text will equal the data needed to train LLMs. Individual models like GPT-3, PaLM, FLAN, Falcon 180B, DBRX, and Llama 3 are plotted along the curve.
From Jones 2024 (https://www.nature.com/articles/d41586-024-03990-2)

From Scaling to New Architectures

The next phase of AI will not be determined by who commands the largest GPU clusters, but by who can design smarter systems. As raw scaling reaches practical and economic limits, progress depends increasingly on software architecture: how components like models, memory, logic and context are integrated into coherent and reliable systems. This transition mirrors the shift in CPU design after Dennard scaling collapsed in the early 2000s, with performance gains coming not from faster single cores but from multicore architectures and better system design.

World Models: Scaling in Disguise or Architecture’s Return?

AI “world models” (WMs) are central in this transition. Properly defined, a world model is the internal, structured representation an AI maintains to simulate and predict its environment. It functions as a latent simulator or explicit knowledge store, enabling the AI agent to perform prediction, planning, and abstract reasoning without real-world interaction. WMs capture the core dynamics and logic necessary for deliberate action, both in near-term embodied AI applications and eventually within AGI. Within AI, world models aim to endow systems with structured, predictive representations of their environments, whether physical or abstract. The crucial question now is whether these world model AI systems will become another form of scaling – larger, more comprehensive, and compute-intensive – or whether they will take forward genuine architectural innovation, emphasizing modularity, specialization, and hybrid reasoning.

Scaling laws remind us to be cautious: simply giving AI systems larger memories or more detailed internal representations might not solve the problems at hand. This shift may not reduce overall computational cost, but merely redistribute it from training to inference. However, by organizing intelligence into modular world models (“continental” models, anyone?) that learn and reason in specialized ways, we may gain real advantages. Such designs can make AI systems learn from fewer examples, behave in ways we can better understand, and be reused across tasks, changing how progress scales, rather than just extending the old pattern.

Diagram titled ‘Overview of the Embodied AI Agent architecture’ showing interactions between a user, an embodied AI agent, and the world environment. The AI agent processes perception inputs (audio, visual, gestures, user queries), memory (user profile, location), and previous states to inform reasoning and planning within a world model. The agent outputs actions through showing and telling (visual or verbal) and physical robot control, forming a feedback loop with the user and environment.
From Fung et al. 2025

From Omnipotent (AGI) AI to Omnipresent (Embodied) AI

Embodied Artificial Intelligence (EAI) unites perception, reasoning, and action within physical systems: think robots. While embodiment may provide useful assumptions and training signals for certain cognitive capabilities, whether it’s necessary or sufficient for AGI remains an open question. EAI does give us a practical waypoint, both from a technological standpoint and an economic one, on the long road to AGI; it marks a transition from purely digital cognition to intelligence grounded in real-world interaction. Recent progress in large language models (LLMs), multimodal LLMs (MLLMs), and world models has energized this research frontier.

LLMs strengthen embodied AI by giving machines the ability to understand meaning, and break complex instructions into manageable steps. Systems such as SayCan show how an AI can take high-level commands expressed in natural language and turn them into practical sequences of actions, all while checking that each step makes sense in context and is physically possible. These systems illustrate how language models can translate human intent into purposeful, goal-directed behavior. But because they rely on defined environments and fixed token vocabularies, their flexibility in adapting to new tasks or devices remains limited.

Recent progress in multimodal LLMs broadens this capability. By integrating information from vision, sound, and touch, these models can interpret their surroundings, recognize spatial relationships, and adjust plans as conditions change. This allows more natural, end-to-end interaction with the real world. However, MLLMs still struggle to align their reasoning with the laws of physics or to adapt reliably in real time when environments shift unexpectedly.

World models redress some of these shortcomings by building internal representations of the world and predicting what might happen next. By encoding basic physical principles such as gravity, motion, and cause-and-effect, these models mean to allow AI agents to simulate outcomes, anticipate risks, and plan safer, more efficient actions. What WMs lack as complement is the broad semantic understanding and flexible reasoning that language models provide.

AI’s most promising path forward may lie in combining multimodal large language models with WMs. MLLMs excel at understanding context and intent (knowing what to do) while world models capture the dynamics of the physical world, enabling systems to reason about how to do it. Unfortunately, integrating the two remains a profound technical challenge. Their representations differ: MLLMs think in semantic embeddings like “pick up the lavender cup,” while world models operate in continuous state spaces of positions and forces. Aligning meaning with physics remains unsolved. Their sense of time also diverges. LLMs move through tokens, and simulators through millisecond timesteps, making synchronization difficult. The technologies also handle uncertainty differently: MLLMs reason probabilistically about words, while world models track physical noise and randomness. Combining these forms of uncertainty will require new (and still distant) probabilistic frameworks that link symbolic reasoning to grounded simulation.

Future progress will necessarily depend on architectural clarity. Possible directions include hierarchical systems where MLLMs set goals and WMs handle execution, end-to-end learnable pipelines that let the two learn together, or agentic architectures with explicit communication between modules. Turning this vision into reality will mean building not just smarter models, but better connections between understanding and action.

Early frameworks such as EvoAgent demonstrate the potential, merging multimodal reasoning with predictive modeling to create systems capable of self-planning, reflection, and autonomous operation across diverse settings. Developing joint MLLM-WM architectures will likely define the next generation of embodied systems. By combining semantic intelligence with grounded physical reasoning, these designs promise to bridge the gap between narrow, task-specific AI and truly general physical intelligence.

Thinking Like Software Engineers – Or Mother Nature?

Achieving robust, deployable AI now appears less a problem of large-scale model training than one of systems engineering. The emerging architectural paradigm emphasizes:

  • Practical problem sets: we do not have the luxury of having AI be a science project in pursuit of some abstract Holy Grail. It can only bring impact (and importantly, investment) if we focus on problems that are bounded by true human and economic needs.
  • Multimodal semantics: MLLMs that help AI systems understand the semantics of “what” by integrating information from text, vision, and other sensory inputs to recognize and interpret objects, entities, and their meanings within context.
  • Full-context engineering: LLMs today only know what’s contained in the current prompt, meaning thousands of tokens at best. But we humans operate on context that spans decades. We need infrastructure that can pull relevant information when needed, track ongoing state, and weigh conflicting data intelligently.
  • True memory: current LLMs jury-rig memory by overloading conversation history into prompts, which is neither scalable nor sustainable. We need AI with structured, multi-timescale memory. Human memory integrates working, episodic, semantic, and procedural subsystems with distinct update and consolidation rules. Building this in AI requires explicit architectures that define how information is stored, updated, and stabilized without catastrophic forgetting.
  • Hybrid computation: systems combining deterministic logic for precise operations, with probabilistic inference for uncertain contexts. LLMs excel at language, but are unpredictable and perform poorly on logical or symbolic tasks. Integrating them within deterministic frameworks allows flexibility without sacrificing reliability. Emerging neuro-symbolic and reasoning-augmented LLMs are beginning to address these weaknesses.
  • Teams of specialists: coordinated, specialized subsystems, handling perception, learning from experience, reasoning, and control are tractable challenges compared to building a single monolithic AI model. Different models for distinct tasks – vision, language, math, planning, and grounding in physical realities – can work together in structured workflows; orchestrating specialized components offers a more effective approach. (See ALOHA, Voyager, and generative diffusion model-based planners).
Diagram comparing prompt engineering and context engineering. The left panel shows prompt engineering for single-turn queries, where a system prompt and user message are processed to produce an assistant message. The right panel illustrates context engineering for agents, showing a curated context window composed of documents, tools, memory files, user messages, and system prompts. This process is iterative, integrating tool calls and results to refine outputs.
From Anthropic 2025

Darwin’s Adaptive Radiation & AI’s Algorithmic Evolution

The trajectory of advanced AI is converging with the principles of biological evolution: a move from monolithic systems toward specialized, modular architectures. This echoes nature’s successful strategy of selection pressure driving functional specialization, an imperative driven by change and adaptive constraints, not architectural foresight. Biological intelligence is a mosaic of localized adaptations (e.g., visual cortex, cerebellum), where emergent cognition arises from the distributed interaction of specialized subsystems, not a unified “super-cell” organ of thought.

Modern AI is undergoing an analogous algorithmic evolution. By adopting established principles of separation of scope and modularity, AI gains scalability and resilience. The central precept is that functional differentiation is the engine of adaptability. The future of general intelligence depends not on creating a single, all-purpose model, but on orchestrating ecosystems of narrowly optimized components. This mirrors nature’s billion-year evolutionary truth: specialization, coordination, and modular structure are the true enablers of robust intelligence.

Future AI systems, much like evolved biological intelligences, will require a hybrid architecture balancing modularity and deep integration. Modularity, essential for engineering and scaling, enables independent development cycles and clean interfaces, as seen in traditional AI pipelines where an independent vision model hands off discrete object detections to a symbolic planner. While this offers separable components, the brittle handoffs necessarily limit robustness. Conversely, systems requiring complex dependencies, such as the blending of MLLMs with WMs, demand deep integration, requiring shared representations and joint training, resulting in a robust but complex (and potentially opaque) end-to-end architecture, from pixels to actions. The challenge will lie in designing a system where core foundation models benefit from the representational power and robustness of deep integration, while maintaining outer layers with clean, minimally integrated modular interfaces to enable easy substitution, adaptation, and system-level reasoning. This duality is necessary to achieve both the flexibility of engineering and the coherence of emergent intelligence.

Bigger Isn’t Always Better

Henceforward, AI’s progress should be assessed not by model size but along multiple axes: sample efficiency (performance per training token), compute efficiency (performance per unit FLOPS), robustness under distribution shift, interpretability, maintainability and safety. Improvements across these metrics signal architectural progress; gains confined to scale will only signal more of the same.

The New Architecture of Artificial Minds

The fundamental challenge in designing truly intelligent systems is not merely one of scale or speed, but of system fragility, bringing forward an issue that moves beyond legacy concerns of redundancy and error correction. We must architecturally confront the fact that probabilistic AI, unlike traditional software, suffers from an endless accretion of uncertainty. Uncertainty which, like misinformation, can ramify uncontrollably through every stage of computation. Furthermore, when complex models are combined, they produce emergent behaviors that are neither intended nor predictable, creating profound ethical and safety risks that defy standard detection and debugging. Finally, the necessary capacity for real-world learning and adaptation – the ability to evolve in response to changing data and adversarial attack – stands in direct tension with AI’s goal of deterministic reliability. The next generation of AI must therefore adopt a new design philosophy, creating systems that are explicitly engineered for statistical integrity, behavioral guardrailing, and controlled, safe evolution.

Architecture in Practice: China’s Model

Industrial AI deployments already reflect this architectural turn, and we’ve seen Chinese firms pursuing embodied AI due to an alignment of capabilities and market opportunities. Domain-specific systems in manufacturing, logistics, and medicine, particularly in China’s applied-AI ecosystem, integrate perception models, symbolic reasoning, and memory systems to solve bounded problems efficiently. Similarly, DeepMind’s AlphaFold demonstrated the power of targeted, engineered solutions over generalized intelligence. These examples suggest that practical impact comes from purposeful design, not from modeling “the entire world.” We might also conclude that full AGI is unneeded in order to deliver positive human or economic benefit. Purpose-built AI appears to do just fine.

Chart comparing data-driven and meaning-driven approaches to intelligence. The x-axis ranges from 100% data-driven (bots) to 100% meaning-driven (humans), and the y-axis represents accuracy versus complexity. A diagonal line labeled 'neuro-symbolic boundary' separates the data-driven and rules-driven regions, with an arrow indicating that scaling data-driven methods has limits. Labels along the bottom include generative chat, context engineering, agentic AI, and 'plain old computer science.

Architectural Innovation or Scaling Redux?

In the end, whether building artificial general intelligence or embodied practical intelligence may fundamentally be a systems engineering challenge, requiring deep experience in designing reliable, distributed systems. Breakthroughs will come from computer science practitioners who understand context management, memory systems, workflow orchestration, and coordination of specialized models at scale. The implications are significant. As diminishing returns limit the benefits of ever-larger models, we may see high-performance AI becoming widely accessible, democratizing capabilities rather than concentrating them among those with the largest compute budgets. Here, world models will represent a design space rather than a prescribed path: they can either perpetuate the scaling paradigm, favoring comprehensive, compute-intensive representations, or enshrine software architecture as the primary driver of progress. Success will depend on whether the AI community emphasizes deployability, efficiency, and modular design over sheer scale. The future will favor those who build the most intelligent systems, not the largest.

The critical question now becomes not, “What is the next model breakthrough?” but, “How can we design system architectures that make practical AI achievable with existing models?” The answer seems clear: the future of AI is architectural, not just statistical. Teams that understand how to integrate reliable, modular components capable of reasoning across domains, while maintaining consistent behavior, will lead. The models we have today are sufficient. The missing ingredient? Perhaps systems engineering to transform them into practical intelligence.