Data access · 8 min read

How to validate a blockchain dataset against a second source

You cannot tell whether a blockchain dataset is complete by looking at it. A table can be perfectly formatted, internally consistent, and pass every structural check, and still be missing rows that nothing in the data points to. Finding them is a procedure: reconcile the data against an independent source, count against count, and narrow the disagreement until it resolves to a single record. This guide is that procedure, run live against Polygon logs with SQD's Portal and Dune, and it ends on a real gap. Every figure is from a query you can copy and run.

Updated 2026-06-30 · By the SQD team

1. What you can check from inside the data

Validation of a dataset comes in two kinds, and only one of them can be done from inside the dataset itself. The first kind is structural and internal consistency: are addresses and hashes well-formed, are the keys unique, are block numbers contiguous, do row counts between related tables agree, do value sums add up. These checks run against the data on hand and they are essential. They are also blind to one thing: a row that was never written.

The second kind is completeness, and it cannot be established from one copy of the data. An omission leaves no internal trace. The only way to detect it is to compare against something that reached the chain by a different path and ask where the two disagree. This guide walks the full procedure on a real dataset: structural checks first, then a count reconciliation against an independent source, then an escalation from a coarse disagreement down to the exact missing record.

2. Structural checks: necessary, not sufficient

Start with the cheap checks, because they catch the common failures: a truncated import, a duplicated batch, a gap in the block range. Run them on a window of Polygon logs in Dune, the source we use for the comparison in this guide:

Dune
-- block continuity: every block present, no gaps
SELECT count(distinct number) AS distinct_blocks,
max(number) - min(number) + 1 AS span
FROM polygon.blocks
WHERE number BETWEEN 74600000 AND 74630000;
-- distinct_blocks = 30001, span = 30001 (contiguous)
-- log-key uniqueness: no duplicated (block, index) pairs
SELECT count(*) AS total_logs,
count(*) - count(DISTINCT (block_number, index)) AS duplicate_keys
FROM polygon.logs
WHERE block_number BETWEEN 74600000 AND 74630000;
-- total_logs = 15825038, duplicate_keys = 0

Both pass. Every block in the window is present, and all 15,825,038 logs have a unique key. Here is the catch: this is the exact window in which the dataset is, in fact, missing records. The checks pass over data that is incomplete, because contiguity and uniqueness describe the rows that are there and say nothing about the rows that are not. Structural validation is necessary and it is not sufficient. To go further you need a second opinion.

3. Reconcile counts against a second source

Reconciliation compares aggregate counts between two datasets, bucket by bucket, and flags the buckets that disagree. The single rule that makes it work is escalating granularity: start coarse, then split only the buckets that disagree into finer ones, so the search narrows by an order of magnitude at each step instead of scanning everything. Reconciling a full table count would bury a handful of missing rows in tens of millions; reconciling a targeted slice by day surfaces them.

Take one slice, the logs emitted by the Polygon state-receiver contract at 0x0000…1001, and count it per day on each side. On Dune that is one grouped query:

Dune
SELECT CAST(date_trunc('day', block_time) AS date) AS day,
count(*) AS commit_logs
FROM polygon.logs
WHERE contract_address = 0x0000000000000000000000000000000000001001
AND topic0 = 0x5a22725590b0a51c923940223f7458512164b1113359a735e86e7f27f44791ee
AND block_time >= timestamp '2025-07-29'
AND block_time < timestamp '2025-08-01'
GROUP BY 1 ORDER BY 1;

The same count from Portal comes off the stream API, filtered to that contract and event over the day's block range. The stream returns matches in chunks, each response ending at a block number you pass back as the next fromBlock until you reach toBlock:

your terminal
# 2025-07-31 UTC spans Polygon blocks 74,609,447 to 74,647,462
curl https://portal.sqd.dev/datasets/polygon-mainnet/stream \
-H 'content-type: application/json' \
-d '{
"type": "evm",
"fromBlock": 74609447,
"toBlock": 74647462,
"logs": [{
"address": ["0x0000000000000000000000000000000000001001"],
"topic0": ["0x5a22725590b0a51c923940223f7458512164b1113359a735e86e7f27f44791ee"]
}],
"fields": { "log": { "transactionHash": true } }
}'

Side by side, the daily counts settle the question that neither source answers alone. Dune's own numbers swing from 125 to 345 a day on natural volume, so a low day looks like nothing in particular. Against Portal, two days separate from the rest:

Day (UTC)PortalDuneMissing
  • 2025-07-292052050
  • 2025-07-3018012555
  • 2025-07-3120618125
State-sync commit logs per day: SQD Portal vs Dune polygon.logs (queried 2026-06-30)

2025-07-29 agrees to the row, which is the result you want most days and the evidence the two counts are measuring the same thing. The next two days do not: Dune is short 55 commit logs on 2025-07-30 and 25 on 2025-07-31, 80 across the two days, spread over 42 state-sync transactions. The reconciliation has told you where to look. It has not yet told you what is missing.

4. Escalate to the record

Keep splitting the disagreeing bucket. Within 2025-07-31, the same count by block lands on individual blocks, and one of them is 74,614,768. Counting all logs in that block, not just the slice, Portal returns 856 and Dune returns 848. The eight-row difference is a single transaction: a synthetic state-sync receipt that Dune's polygon.transactions omits entirely, holding 114 records where Portal holds 115.

MeasurePortalDune
  • transaction records115114
  • log records856848
  • logs from the state-receiver 0x…100120
Block 74,614,768 on Polygon: SQD Portal vs Dune

That is the bottom of the escalation: a coarse daily disagreement, narrowed in two steps to two specific log records that one source has and the other does not. The records turn out to be Polygon state-sync logs, a class the chain's block header never commits to, which is why a structurally valid dataset can drop them without any internal check noticing. The mechanism, and why no Merkle proof would have caught this either, is the subject of the companion guide on the onchain data block proofs leave out.

5. A source that does not share the blind spot

Reconciliation is only as good as the independence of the two sources. Two datasets built from the same upstream, or on the same assumption about what counts as a transaction, share the same gaps and agree with each other while both being wrong. The state-sync case is exactly that trap: any pipeline that reconstructs logs from the canonical transaction list drops the same records, so a second source built the same way would have confirmed the wrong answer.

So a disagreement is a lead, not a verdict, and the record gets confirmed on a third source that reaches the chain differently: an archive node's eth_getLogs, which returns the state-sync logs directly, or the public explorer, which renders the synthetic transaction in full. When three sources with different ingestion paths agree on a record and one dataset lacks it, the gap is real. This is why a validation setup worth the name cross-checks against several sources, RPC nodes and independent warehouses among them, rather than a single mirror.

6. Reconciliation with SQD

The procedure above is the one SQD runs against its own data on the way in: structural checks, then count reconciliation against independent sources at escalating granularity, then record-level spot checks, repeated as new blocks arrive rather than once at import. It is the part of data quality that proofs and structural checks cannot reach, and it is why Portal returns records like the state-sync logs above that a transaction-list pipeline silently omits.

The same queries are the reader's to run. Portal serves logs, transactions, traces, and state diffs as a stream you can reconcile any third-party dataset against, and the Squid and Pipes SDKs pull the same data into your own store for continuous checks. For where complete, reconciled data is load-bearing, see the analytics and compliance solutions.

Frequently asked questions

How do you validate blockchain data for completeness?
In layers. Structural checks confirm the rows you have are well-formed, unique, and contiguous; integrity checks confirm row-count invariants between related tables. Neither establishes completeness, because a missing row leaves no trace inside the data. For that you reconcile counts against an independent source, escalating granularity from coarse buckets down to the record, then confirm the offending rows on a third source. Completeness is the property a single dataset cannot prove about itself.
Why are structural checks not enough?
They verify shape, not coverage. Block numbers can be perfectly contiguous and every (block, log_index) key unique while entire records are absent, because the absence is not visible from the rows that remain. In the worked example here, a dataset passes block-continuity and uniqueness checks over the exact 30,000-block window in which it is missing 47 state-sync log records. Structural checks catch corruption and duplication; they are blind to omission.
What makes a good second source for reconciliation?
Independence. Two datasets built from the same upstream pipeline share the same gaps, so they agree while both being wrong. A useful second source reaches the chain by a different path: an archive node's RPC, a different warehouse, or the block header's own commitments where they apply. Use count aggregations for breadth and record-level spot checks for depth, and confirm any disagreement on a third source before trusting it.
How do you find which records are missing, not just that counts differ?
Escalate granularity. Start with coarse time buckets and keep splitting only the buckets that disagree: month to day, day to hour, hour to block, until the difference is a handful of records. Then diff the record keys directly. Each step narrows the search by an order of magnitude, so a gap of a few rows in tens of millions is reachable in a few queries rather than a full-table scan.
Can you validate a dataset without a second source at all?
Only partially. Structural and internal-consistency checks catch a great deal, and cryptographic commitments verify the rows that are committed to the block header. But the completeness of anything the header does not commit to cannot be established from a single source. The companion guide on what block proofs leave out covers why, with the class of logs that is missing here.

Need a dataset you can reconcile against?

Stream logs, transactions, traces, and state diffs from Portal and check any source against them.