The Fellegi-Sunter model¶
This topic guide gives a high-level introduction to the Fellegi Sunter model, the statistical model that underlies Splink's methodology.
For a more detailed interactive guide that aligns to Splink's methodology see Robin Linacre's interactive introduction to probabilistic linkage.
Parameters of the Fellegi-Sunter model¶
The Fellegi-Sunter model has three main parameters that need to be considered to generate a match probability between two records:
- probability that any two records match - probability of a given observation given the records are a match - probability of a given observation given the records are not a match
λ probability¶
The lambda (
This is the same for all records comparisons, but is highly dependent on:
- The total number of records
- The number of duplicate records (more duplicates increases
) - The overlap between datasets
- Two datasets covering the same cohort (high overlap, high
) - Two entirely independent datasets (low overlap, low
)
- Two datasets covering the same cohort (high overlap, high
m probability¶
The
For example, consider the the
- DOB is the same:
- Almost 100%, say 98%
- DOB is different:
- Maybe a 2% chance of a data error?
The
u probability¶
The
For example, consider the the
- Surname is the same:
- Depending on the surname, <1%?
- Surname is different:
- Almost 100%
The
Interpreting m and u¶
In the case of a perfect unique identifier:
- A person is only assigned one such value -
(match) or (non-match) - A value is only ever assigned to one person -
(match) or (non-match)
Where
m probability¶
A measure of data quality/reliability.
How often might a person's information change legitimately or through data error?
- Names: typos, aliases, nicknames, middle names, married names etc.
- DOB: typos, estimates (e.g. 1st Jan YYYY where date not known)
- Address: formatting issues, moving house, multiple addresses, temporary addresses
u probability¶
A measure of coincidence/cardinality1.
How many different people might share a given identifier?
- DOB (high cardinality) – for a flat age distribution spanning ~30 years, there are ~10,000 DOBs (0.01% chance of a match)
- Sex (low cardinality) – only 2 potential values (~50% chance of a match)
Match Weights¶
One of the key measures of evidence of a match between records is the match weight.
Deriving Match Weights from m and u¶
The match weight is a measure of the relative size of
where
A key assumption of the Fellegi Sunter model is that observations from different column/comparisons are independent of one another. This means that the Bayes factor for two records is the products of the Bayes factor for each column/comparison:
This, in turn, means that match weights are additive:
where
So, considering these properties, the total match weight for two observed records can be rewritten as:
Interpreting Match Weights¶
The match weight is the central metric showing the amount of evidence of a match is provided by each of the features in a model. The is most easily shown through Splink's Waterfall Chart:
- 1️⃣ are the two records being compared
-
2️⃣ is the match weight of the prior,
. This is the match weight if no additional knowledge of features is taken into account, and can be thought of as similar to the y-intercept in a simple regression. -
3️⃣ are the match weights of each feature,
, , , and respectively. -
4️⃣ is the total match weight for two observed records, combining 2️⃣ and 3️⃣:
-
5️⃣ is an axis representing the
) -
6️⃣ is an axis representing the equivalent match probability (noting the non-linear scale). For more on the relationship between match weight and probability, see the sections below
Match Probability¶
Match probability is a more intuitive measure of similarity than match weight, and is, generally, used when choosing a similarity threshold for record matching.
Deriving Match Probability from Match Weight¶
Probability of two records being a match can be derived from the total match weight:
Example
Consider the example in the Interpreting Match Weights section.
The total match weight,
Understanding the relationship between Match Probability and Match Weight¶
It can be helpful to build up some intuition for how match weight translates into match probability.
Plotting match probability versus match weight gives the following chart:
Some observations from this chart:
So, the impact of any additional match weight on match probability gets smaller as the total match weight increases. This makes intuitive sense as, when comparing two records, after you already have a lot of evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.
Similarly, if you already have a lot of negative evidence/features indicating a match, adding more evidence/features will not have much of an impact on the probability of a match.
Deriving Match Probability from m and u¶
Given the definitions for match probability and match weight above, we can rewrite the probability in terms of
Further Reading¶
This academic paper provides a detailed mathematical description of the model used by R fastLink package. The mathematics used by Splink is very similar.
-
Cardinality is the the number of items in a set. In record linkage, cardinality refers to the number of possible values a feature could have. This is important in record linkage, as the number of possible options for e.g. date of birth has a significant impact on the amount of evidence that a match on date of birth provides for two records being a match. ↩