Architecture · 18 min read

The blockchain data stack in 2026

The blockchain data stack is the set of layers between a blockchain and a product that needs its data: nodes and RPC at the bottom, then indexers, data APIs, decentralized data networks, warehouses, and the analytics and AI agents on top. This guide is the map. It walks each layer, what it does and the trade-offs it carries, and links down into a deeper guide for every one.

Updated 2026-06-04 · By the SQD team

What the blockchain data stack means

A blockchain is an append-only log built for consensus, not for queries. The data is there, every transfer, swap, and contract call since genesis, but it is laid out for validation, not for the questions a product actually asks. Reading "every USDC transfer to this wallet last month" straight from a node means scanning blocks one by one. The blockchain data stack is the tooling that closes that gap: it turns raw chain data into something an application can query in milliseconds.

"Stack" is the right word because the pieces are ordered. Each layer consumes the one above it and exposes a cleaner interface to the one below: nodes expose raw blocks, indexers decode those blocks into tables, APIs serve the tables, warehouses store them for analysis, and products read the result. You rarely build all six layers. Most teams enter the stack at the layer that matches their need and buy or skip the rest.

The choice of entry layer comes down to a few axes: how fresh the data must be (real-time versus historical), the shape of the queries (point lookups versus analytics over history), the scale, whether the team self-hosts or buys, and whether it covers one chain or many. The RPC versus indexed data guide is the clearest single decision point; this page sets it in the context of the whole stack.

The stack at a glance

Six layers between the chain and the product. The chain sits at the top; each layer consumes the one above it and passes cleaner data down to your application at the bottom. The highlighted band is where SQD operates, and every row links to its section below.

The blockchain raw data
  1. 1

    Nodes and RPC endpoints, where the data originates.

    RPC endpointsArchive nodesNode providers
  2. 2

    Decoding raw chain data into queryable tables.

    Squid SDKPipes SDKSubgraphsPonderManaged
  3. 3

    The query interface over indexed data.

    RESTGraphQLSQLReal-time
  4. 4

    Many independent operators serving data, not one vendor.

    ValidationRedundancyNetwork economics
  5. 5

    Where decoded data lands for analytics, and how it is backfilled.

    PostgresClickHouseParquetBackfill
  6. 6

    What reads the data: dashboards, apps, AI agents.

    AnalyticsAI agentsDashboards
Your application query-ready
Layers 1 to 3 are the path every product takes from chain to query. Layer 4 is how layers 2 and 3 are operated. Layers 5 and 6 are where analytics and AI workloads live. SQD spans indexing through warehouse feeds.

Layer 1: Raw access (nodes and RPC)

At the base of the stack are the nodes that run the network. A node validates blocks and keeps a copy of chain state; an archive node additionally keeps the full historical state at every past block, which is what lets you query the chain as it looked years ago. Archive nodes are storage-heavy and operationally demanding to run, which is why most teams reach the chain through a managed JSON-RPC provider, such as Alchemy or QuickNode, that runs the nodes for them. The node software underneath is an Ethereum execution client such as Geth or Reth.

RPC is the right tool for a specific job: reading current or recent state by key, and submitting transactions. Fetch one account's balance, read a contract's current value, send a transaction, RPC does all of this well. Where it struggles is analytical access. Methods like eth_getLogs cap the block range and result count per call, so reconstructing a long history means many sequential round-trips and wall-clock time that grows linearly with the range. The RPC versus indexed data guide covers exactly when a raw endpoint is enough and when the next layer up pays off.

For head-to-heads on the RPC providers specifically, see SQD versus Alchemy and SQD versus QuickNode, which contrast a node provider's raw RPC with a read-optimized historical data layer.

Layer 2: Indexing

Indexing is the layer that turns raw chain data into decoded tables a product can query. An indexer reads blocks, transactions, and event logs, decodes them against contract ABIs, transforms the result into a schema, and writes it to a database. The what is a blockchain indexer guide walks the full pipeline; this section places it in the stack.

There are two ways to get an indexer: author your own with a framework, or buy a managed one. On the framework side, The Graph's subgraphs (mappings in AssemblyScript, documented at thegraph.com/docs) are the most widely deployed. SQD's Squid SDK and Pipes SDK (TypeScript) and Ponder (TypeScript) take a batch-processor approach, and Envio is another TypeScript option. On the managed side, Goldsky hosts subgraph-compatible and streaming pipelines so a team avoids running the infrastructure. The build versus buy decision turns on how much schema control you need against how much operations you want to own.

The decoding work differs by virtual machine. EVM chains emit typed event logs keyed by a signature hash and ABI (the EVM indexer guide), while Solana programs execute instructions identified by program ID and discriminator (the Solana indexer guide). A product that spans several chains, or crosses virtual machines, benefits from a multi-chain indexer that lands every chain in one schema. For The Graph specifically, including what migrating a subgraph involves, see SQD versus The Graph and SQD versus Goldsky.

Layer 3: Data APIs

Indexed tables need an interface, and that interface is the data API. Three shapes dominate. GraphQL became widespread for onchain data largely because The Graph standardized on it, and it fits typed, relational reads from a known schema. SQL fits analytics, ad hoc aggregation and joins across large history, which is the model dashboards like Dune expose. REST fits simple, cacheable reads. The blockchain data API guide covers the named providers and what each interface is good at, so this page does not duplicate that rundown.

The other axis at this layer is real-time against historical. A live feed (WebSockets, server-sent events, or polling) keeps a product current with the chain head; a historical query serves data from the past without the caller running an archive node. Production systems usually need both: a low-latency stream for the live view, and bulk historical access for backfills and analytics. SQD's Portal serves filtered ranges of raw onchain data (logs, transactions, traces, and state diffs) over HTTP across the networks listed at sqd.dev/chains, with full history available; the Squid and Pipes SDKs decode it into typed tables. The SQD versus Dune comparison contrasts a SQL analytics surface with a programmable data lake.

Layer 4: Decentralized data networks

Layer 4 is not a separate step in the path from chain to query; it is a property of how layers 2 and 3 are operated. A centralized data provider runs the indexers and serves the API from infrastructure it owns. A decentralized data network spreads that serving across many independent operators, coordinated by a protocol that handles which operator serves which data, how results are validated, and how operators are paid.

The trade-off is redundancy and reduced single-provider dependence against added coordination complexity. The two prominent examples are The Graph's network (documented at thegraph.com/docs) and SQD's own decentralized network, which serves the Portal from a distributed set of worker nodes (see docs.sqd.dev). Whether decentralization matters for a given product depends on its tolerance for depending on one vendor for its data.

Layer 5: Warehouses and pipelines

Once data is decoded, analytics and machine-learning workloads want it in a warehouse rather than behind a request-response API. Postgres is the common default for application and transactional workloads. ClickHouse and other columnar stores suit high-volume analytical scans. Teams that work file-first store Parquet datasets and query them with engines like DuckDB. The choice follows the query pattern: point lookups favor Postgres, wide scans over long history favor a columnar engine.

The pipeline is what fills the warehouse. A streaming pipeline keeps the store current with new blocks; a backfill loads the historical range up front. Backfilling years of decoded history is the expensive part if it means crawling an archive node, which is why this layer often pulls from a managed data lake instead. SQD's Pipes SDK streams decoded data into Postgres or ClickHouse with reorg handling built in, and the Portal serves the historical ranges a backfill needs without the team operating a node.

Layer 6: Consumption (analytics and AI agents)

The top of the stack is whatever reads the data. Analytics is the largest slice: dashboards and BI built on the warehouse, whether through a hosted tool like Dune (the SQD versus Dune comparison covers the analytics surface) or an in-house stack over Postgres or ClickHouse. Around analytics sit the use-case verticals, each reading the same lower layers for a different purpose: analytics, DeFi and trading, wallets and payments, compliance, stablecoins, and real-world assets.

The fastest-growing consumer is the AI agent. An agent acting on blockchain state needs structured, queryable data, not raw RPC responses, and it needs to reach that data through a tool interface. The Model Context Protocol (MCP) has become the common pattern for exposing a data source to an agent as a callable tool, and SQD provides an MCP server over the Portal (documented at docs.sqd.dev). Agent workflows mix pre-indexed history for fast lookups with on-demand reads for the latest state.

Putting the stack together

Few products build all six layers. The practical question is which layer to start from, and the answer follows the workload:

Where to start
  1. A simple dapp reading current state Layer 1
  2. Querying history or aggregating Layer 2Layer 3
  3. Analytics or machine learning Layer 5
  4. A product spanning many chains Layer 2
  5. An AI agent acting on chain state Layer 3Layer 5

A simple dapp reading current state can live at Layer 1 on an RPC provider and skip the rest. A product that queries history or aggregates needs Layer 2 indexing and a Layer 3 API, self-hosted with a framework or bought managed. An analytics or ML team adds Layer 5, piping decoded data into a warehouse for SQL. A multi-chain product wants a multi-chain indexer so adding the next chain is a configuration change rather than a new stack. An AI agent reads Layer 3 or Layer 5 through a tool interface such as MCP.

The recurring decisions are the same at every entry point: real-time against historical, point lookups against analytics, self-host against buy, and single-chain against multi-chain. The RPC versus indexed data and multi-chain indexing guides go deep on two of them, and the comparison pages put specific tools side by side.

SQD sits across the middle of the stack: the Portal is a data lake that serves filtered, raw onchain data over the networks at sqd.dev/chains, and the Squid and Pipes SDKs are the indexing frameworks that decode it on top.

Frequently asked questions

What is the blockchain data stack?
The blockchain data stack is the set of layers that sit between a blockchain and a product that needs its data. From the bottom up: nodes and RPC endpoints expose the raw chain, indexers decode it into tables, data APIs serve those tables, decentralized data networks distribute the serving across many operators, warehouses store it for analytics, and the top layer is whatever consumes it, from dashboards to AI agents. Most teams use a subset rather than all six layers.
What are the layers of a blockchain data stack?
Six layers: raw access (nodes and RPC), indexing (decoding raw data into tables), data APIs (REST, GraphQL, or SQL query interfaces), decentralized data networks (the serving layer distributed across independent operators), warehouses and pipelines (Postgres, ClickHouse, or Parquet for analytics, plus the backfill that loads them), and consumption (analytics, BI, and AI agents). Each layer consumes the one beneath it and exposes a cleaner interface upward.
Do I need an indexer, or is an RPC node enough?
An RPC node is enough when you read current state or recent activity by key, for example a single account balance or the receipt for one transaction, and when you submit transactions. You need an indexer when you query across history or aggregate, for example every transfer to an address over a year, or swap volume per pool per day. RPC has no efficient way to answer those; an indexer pre-decodes and stores the data so the query is a database lookup.
What is the difference between a blockchain indexer and a data API?
An indexer is the process that reads raw chain data and writes decoded, queryable tables. A data API is the interface those tables are served through, such as GraphQL, SQL, or REST. They are adjacent layers: the indexer produces the data, the API exposes it. Some products bundle both; some serve a data API on top of an indexer someone else runs.
Where is indexed blockchain data stored?
Commonly in Postgres for transactional and application workloads, and in a columnar store such as ClickHouse for high-volume analytics. Teams that work file-first use Parquet datasets queried with engines like DuckDB. The choice follows the query pattern: point lookups and moderate volume favor Postgres, while wide scans over large history favor a columnar engine.
How do AI agents get onchain data?
An AI agent needs structured, queryable onchain data rather than raw RPC responses. In practice that means reading from a data API or a warehouse through a tool interface, increasingly the Model Context Protocol (MCP), which lets an agent call a data source as a tool. The data can be pre-indexed for fast historical queries or fetched on demand for the latest state, and most agent workflows use both.

Build your stack on SQD

The Portal serves raw onchain data across the networks at sqd.dev/chains. The Squid and Pipes SDKs are the indexing layer that decodes it on top.