Building an AI-native data architecture

Etienne Oosthuysen
2 days ago
8 min read

From Facts to Features: The Heart of AI-Native Architecture

AI is no longer an afterthought in data architectures. It’s the reason we’re redesigning everything—from ingestion pipelines to transactional systems. The question is no longer whether your architecture supports AI, but whether it was built for it. This may sound extreme, but it’s really not. Most innovative business out there are rethinking how they are doing things, some even working towards whole business units that are fully agentic operated.

Data architectures within business should be no different. A few years ago, building a modern data platform meant stitching together the right components to support reporting and analytics. Data lakes handled storage, transformation pipelines did their thing on a schedule, even if near real-time, and reports and dashboards served up answers—albeit delayed and often narrowly scoped. Artificial intelligence, if it was on the radar at all, lived on the fringe of these architectures: in notebooks, in silos, and often in someone else’s backlog, or with limited contribution to what became data assets and data products during the transformation pipelines.

This is no longer the case, the rise of large language models, real-time inference and means that the old thinking around data platforms (and when I say old, I literally mean until very recently) are now constraints, it has exposed the limits of traditional data stacks. AI is no longer something we integrate later—it’s what the architecture needs to serve. FROM. THE. START.

Welcome to the era of the AI-native data architecture. But what is it.

AI-native data architecture – what is it

Essentially, in an AI-native data architecture, everything is more connected. Instead of having separate systems for OLTP, analytics, and machine learning, it’s all part of the same platform. The same data used to build customer reports also feeds feature stores, which provide consistent inputs for training models and making real-time predictions.

I like food, let’s use that as an analogy.

It’s a bit like upgrading from a kitchen that cooks one meal at a time to a fully equipped prep kitchen that can handle anything the restaurant needs—all at once. In the past, each team might have ordered up their own "meal": a tailored dashboard here, a one-off dataset for a model there. These were packaged as data products—built for a single purpose, often reworked from scratch, and, if done correctly, highly reused.

But in an AI-native architecture, data isn’t treated like a series of pre-made meals. It’s treated like high-quality ingredients—washed, chopped, labelled, and always ready. That’s the role of the feature store. It’s where reusable pieces of data—like customer age, product interactions, or credit risk flags—are prepared and stored, so they can be quickly added to any dish: a model, a dashboard, or even a real-time chatbot. No more starting from scratch.

The benefit? When a new request comes in—whether it’s to train a new model or enable a smart assistant—the system doesn’t grind to a halt. It assembles what’s needed using trusted, consistent building blocks. These components are versioned and tagged so you know exactly what’s being used, and where. Data becomes modular, governed, and ready for reuse. This builds confidence in AI and Analytics at scale.

In this model, AI becomes something you build with, not bolt on.

Hang on, this smells like the concept of authoritative data, which is not new, repackaged

The idea of creating clean, trusted, reusable datasets is hardly new. It’s at the very core of the lakehouse model—with bronze, silver, and gold layers designed to progressively refine data into business-ready assets. In fact, the silver layer has always aimed to produce authoritative datasets that can be used across dashboards, reports, and even machine learning pipelines. So yes—it’s fair to say that reusability and trust were already part of the modern data playbook.

But here’s the shift.

In an AI-native architecture, we’re not just curating data for downstream use—we’re engineering it specifically to serve intelligent systems. The goal isn’t just trusted analytics anymore. It’s precision, speed, and adaptability for models and agents that operate in real time. It’s about shaping data into features, not just facts—packaged and served in a way that supports both training and live inference.

Facts

Facts are clean and structured data points that describe what happened. These are the kinds of values you’d see in a report or dashboard: how much a customer spent last month, how many products were sold yesterday, or the current status of an account. They’re essential for understanding the business, but they’re not enough to drive intelligent systems on their own.

In the dataset above we can report on total revenue, the number of recent transactions, or sales by category.

Features

Features on the other hand are the signals within the data, engineered to be reusable, consistent, and immediately usable across models, agents, real-time applications, as well as analytics.

The Features above are Aggregated (Avg Spend), Engineered (Days Since Last Purchase), Encoded (Preferred Category) or Derived using thresholds or logic (High Value Customer).

Looking at the feature version of the same data, you can start to see the shift. Rather than just showing what happened, we’re capturing meaningful signals—like how engaged a customer is, whether they meet a high-value threshold, or what their likely preferences are. These aren’t just helpful for reports—they’re machine-readable, version-controlled, and designed for repeated use across models, apps, and decision-making systems.

And this shift doesn’t stop with how the data is shaped—it changes how it flows through the architecture. In a typical lakehouse setup, those silver or gold datasets are usually processed in batches—great for analytics, and even for training ML models offline. But AI-native systems operate with very different expectations. They need these features available on demand, sometimes within milliseconds, to power conversational agents, personalised recommendations, or real-time fraud detection.

It also changes how we govern. It’s not enough to know that a dataset is certified—you need to know that each feature is accurate, explainable, and being used appropriately. You need lineage that tracks how a customer’s risk score was generated, which prompt powered which decision, and whether a model used the correct version of a feature during inference.

So no, this isn’t just another iteration of the same idea.

Does bronze, silver and gold still apply?

Yes, BSG still applies, but in an evolved role within an AI-native data architecture. It remains foundational, but it’s not sufficient on its own.

The medallion architecture is still:

A sound framework for data refinement,
A way to build trust and structure into raw data pipelines,
And a critical foundation for authoritative, reusable data assets.

So no, AI-native architectures don’t abandon BSG — they extend it. In AI-native the scope of each layer broadens:

Bronze still ingests raw data, but needs to support real-time AI needs, not just batch.
Silver is no longer just for cleaning and joining—it’s a launchpad for feature engineering and streaming model input. It also now has a branch: the feature store, where selected, engineered features are versioned, tagged, and optimised for reuse across training pipelines, real-time inference, and intelligent agents.
Gold is not the end—it may include model outputs, semantic-ready data for agents, or act as a jumping-off point for prompt orchestration.

BSG is the reliable kitchen—but AI-native platforms are now serving a much bigger menu.

Authoritative datasets are still the foundation—but AI-native design is what elevates that foundation into intelligence infrastructure. It’s what brings together feature stores, real-time pipelines, model observability, and prompt orchestration into something the traditional data platform never aimed to do.

It’s not just about having clean data anymore. It’s about making that data instantly useful, context-aware, and action-ready—because in an AI-native world, latency isn’t just technical, it’s competitive.

And that’s what sets this new architecture apart.

Does this mean analytics now moves to the periphery, like AI used to be?

No. Analytics doesn’t move to the periphery. Instead, it moves into a shared centre, alongside AI.

Rather than building separate pipelines for Analytics and AI, we now recognise that Silver and Gold data assets can serve both. A cleaned, enriched dataset may feed a Power BI dashboard and a machine learning model. A curated Gold asset might appear in a quarterly board pack and be used as context by a generative AI agent. What differs is how these assets are structured, optimised, and governed—not where they live.

A sales dataset might be transformed into:

An analytics-ready table summarising revenue per region and per month, and
A feature set for a churn model, including rolling spend, basket size variance, and days since last purchase.

These aren’t competing outputs. In fact, they often coexist in the same Silver workspace — just with different naming conventions, pipelines, and optimisation goals.

Of course, added to the AI-native architecture are

A Feature Store: branches from silver/gold; optimised for model inputs.
Prompt & Agent Layers: consume both features and gold-level summarised data.

What It Means for Technology Choices

An AI-native architecture, although still grounded in the familiar Bronze–Silver–Gold pattern and its extensions, does require a fresh look at how we employ technology. It’s no longer just about assembling pipelines that end in dashboards. It’s about building platforms that enable insight, prediction, and action as part of a continuous, intelligent loop.

The major cloud platforms are evolving to support this shift—some more natively than others.

AWS offers a wide range of AI and ML services through SageMaker, Redshift ML, and Bedrock. These are powerful, but typically require significant integration effort to achieve end-to-end alignment between transactional systems, analytics workflows, and AI workloads. Stitching these capabilities into a seamless architecture still demands considerable engineering effort.

Microsoft Fabric is making strides toward unified data and AI with its OneLake architecture, tightly integrated semantic models, and built-in Copilot experiences. The alignment with Microsoft 365 creates strong potential for business-user-facing intelligence. However, its generative AI and agent capabilities remain early-stage, particularly when it comes to real-time inference, feature governance, and operational AI observability.

Databricks, especially in light of the announcements made at the 2025 Data + AI Summit, is positioning itself as a platform purpose-built for AI-native design. It is evolving from a lakehouse platform into something broader: a full-stack data and AI operating system. One that allows transactional data, curated features, trained models, and intelligent agents to operate in the same environment, under consistent governance and lineage. Two announcements from the summit stand out.

Lakebase, now in private preview, brings Postgres-compatible OLTP into the lakehouse itself. This allows transactional applications and systems to write directly to the same environment that powers models and dashboards. Data generated by operational systems becomes instantly available to models and agents, closing the loop between action and intelligence. Here, the lines between operational (and integration) and analytical systems are now blurred. You can run transactional workloads and, in the same environment, make that data immediately available to AI models and dashboards. That kind of convergence wasn’t just impractical five years ago—it was unthinkable. As a side-note, Snowflake Unistore is another interesting prospect where transactional and analytical workloads are merged, including row-store support for OLTP-style use cases.
Agent Bricks, now in beta, introduces a framework for building and deploying AI agents that are optimised on your enterprise data. These agents are not just experiments—they’re production-ready, evaluated, cost-controlled, and observable. They can interact with feature stores, prompt chains, or even transaction tables—responding to live context within your governed ecosystem.

These developments don’t replace the need for strong foundations. Instead, they extend what a modern data architecture can do. They signal that AI-native design has arrived. As soon as these technologies and others are in public preview, I will run some test drives for you 😊

This article was first published here: (2) Building an AI-native data architecture | LinkedIn