The Importance of Match Probability in Record Linkage

This blog highlights the record linkage accomplishments we made in 2025 and outlines our plans for 2026, with a focus on linkage data requests.

A clear, consistent linkage ensures that each cancer case is counted only once — not missed or duplicated. Although our internal work is based on the fiscal year, this article uses calendar years for clarity.


Historical Context
Aristotle described the “law of the excluded middle” (around 350–300 BCE): something is either true or false — no third option. Record linkage challenges this idea. When two records appear similar but not identical, the “middle” — uncertain — is unavoidable. The most widely used statistical framework for resolving this uncertainty is the Fellegi–Sunter (1969) model. Historically, two implementation approaches have been common: score-based algorithms, such as EpiLink, and Expectation-Maximization (EM) algorithms, which assign a match probability to each record pair.

Over the past five years, we’ve shown that match probabilities lead to clearer, more consistent, and more defensible linkage decisions than score-based methods. This approach offers a principled way to reduce uncertainty and enhance accuracy.
‍ ‍

What We Accomplished in 2025

In the FCDS 2024 monograph, Comparing the Linkage Performance of fastLink, Splink, and Match*Pro at the Florida Cancer Registry Using Simulated Pseudopeople Data, we reported the importance of match probability. Compared with fastLink, Splink predicted about 9% more correct matches, while Match*Pro predicted about 20% fewer. Based on these results, the FCDS recommended transitioning from fastLink to Splink.
‍ ‍
During 2025, we focused on creating a Splink template and cross-training on fastLink to strengthen our record linkage capabilities.
‍ ‍

Looking Forward to 2026
In late 2026, we will begin a new round of performance comparisons of fastLink, Splink, and Match*Pro. Each tool is expected to release major updates. The fastLink R package is anticipated to reach version 1.0, incorporating semi-automated clerical review (“Active Learning”) and probabilistic blocking through Locality Sensitive Hashing (LSH). These additions are expected to improve both accuracy and speed, while making the software more user-friendly.

Splink, the Python‑based library, is expected to release version 5.0. This update is projected to bring improvements in accuracy, speed, and usability, with the new Clerical Review tool likely to be the most significant enhancement.

Match*Pro, a proprietary stand‑alone application, has moved away from its earlier EpiLink-style scoring and now uses an EM algorithm in version 3.x — the same general approach used by fastLink and Splink. This shift brings all three tools into closer methodological alignment.

Why Thresholds Matter

The key metric for comparing all three tools will be the threshold match probability. Threshold choice directly affects the final linked dataset. For example, increasing the threshold from 0.95 to 0.99 discards pairs with a match probability of 0.96. Higher thresholds result in more missed matches (false negatives). Lower thresholds result in more wrong matches (false positives). FCDS policy requires releasing only expected correct matches. Expected wrong matches must not be released.

Based on previous testing, we anticipate that Splink 5.x with a threshold of 0.99 will become the expected default for linkage data requests. fastLink 1.x with a threshold of 0.999 and Match*Pro 3.x with a threshold of 0.9999 are likely to serve as optional alternatives. These predictions may change as new versions are released and evaluated.

We Want Your Input

Predictions are always uncertain — and this is where your feedback matters. What would you like the upcoming FCDS comparison to cover? Our goal is to make linkage decisions more transparent, consistent, and reproducible. Match probability is the foundation of that progress.


Click on the tags below to see related articles:

Next
Next

Updated FCDS Data Vizualization Dashboard