Larsblog

Bayesian identity resolution

Previous
Next

Foggy morning, Oslo fjord

Stian Danenbarger has been telling me for a while about entity resolution (as he and many others call it), or identity resolution (as Wikipedia calls it). Basically, it's the process of working out which records/entities/objects actually represent the same real-world things by comparing their properties. Once Stian confirmed that Bayesian inferencing was a common method for this, I suddenly saw how you can actually do a poor man's version of this with just a little basic scripting.

Background

The context is that I'm extracting data from an ERP system, and I needed to see if there was significant overlap between the supplier table and the customer table. And, while I was at it, if there were duplicates within the tables. I started with in-table checking, since that was the easiest, and after two hours of Python programming I had something that spewed out reams and reams of duplicates in the data (and some false positives, admittedly). Since it was so easy to get useful results, I thought it was interesting enough to write about.

The basic idea is almost ridiculously simple: you pick a set of properties which provide clues for establishing identity. To compare two records you compare the properties pairwise and for each you conclude from the evidence in that property alone the probability that the two records represent the same real-world thing. Bayesian inference is then used to turn the set of probabilities from all the properties into a single probability for that pair of records. If the result is above a threshold you define, then you consider them duplicates.

The code

The suppliers table that I'm working with contains 41 columns, most of which are no use whatever to finding duplicates. For example, the time a record was last modified doesn't help, nor does the primary key. And so on. As far as I can see, only six columns are of any help here. So apart from the framework (and some data-cleaning functions), once I've imported the framework, this is all the Python code I need to check for duplicates:

THRESHOLD = 0.85
file = "suppliers-merged.csv"

relevant = [Property("NAME", 0.5, 0.9, name_clean, name_compare),
            Property("ASSOCIATION_NO", 0.1, 0.88, orgnr_clean),
            Property("ZIP_CODE", 0.4, 0.6, basic_clean),
            Property("COUNTRY", 0.1, 0.51, basic_clean),
            Property("ADDRESS1", 0.49, 0.75, basic_clean),
            Property("ADDRESS2", 0.49, 0.75, basic_clean)]

self_compare(file, THRESHOLD, relevant)

So, we set the probability threshold, give the file name, and define how we want to treat our six identity-discriminating properties. Let's walk through them one by one.

NAME is the name of the organization. Here we provide a function to normalize the names (name_clean), which strips out extra whitespace, changes "Ltd." into "Limited", lowercases the name, and so on. The first number says that if the names of two records are different, the probability that the organizations are the same is 0.5 (that is, 50/50, meaning we've learned nothing). The second says that if the names are the same the probability that they are the same is 90%. This is above our threshold of 0.85, so unless we later find evidence indicating that the organizations are different we will consider them duplicates.

ASSOCIATION_NO is the unique organization number assigned to the organization by the government. You'd think we could just do SQL directly on this column and go home, but it's not that simple. The numbers are formatted differently, not all suppliers have one, and in some cases different parts of a large organization (like the City of Oslo) share a single number. We provide a function to clean up the data, then specify our probabilities: if both numbers are provided and they are different, there's only a 10% chance the organizations are the same. However, if we get a match there's an 88% chance that they are the same.

So here you see immediately a benefit of the approach: let's say two organizations have the same name, but different organization numbers. That gives us 0.9 and 0.1 probability, which combines to 0.5. Having weighed the evidence, Bayes steps in and saves us from making what's probably a mistake. However, if two organizations have the same name and no organization number, we get 0.9 and 0.5, which combines to 0.9.

Anyway, ZIP_CODE is the next one. The zip code provides only a minor hint, but if the two suppliers have different codes the likelihood they are the same is slightly lower (0.4), and if they have different ones it is slightly higher (0.6). Let's say two records have the same name, no organization number, and different zip codes. This gives us 0.9, 0.5, and 0.4. This combines to 0.857, which is just above the threshold. If we find more counter-indicating evidence this pair is going to drop below the threshold.

COUNTRY is much the same. If the organizations are in different countries chances are very slim that they're the same one (although it could be a typing error), whereas if they are in the same country that barely signifies.

ADRESS1 and ADDRESS2 are much like the zip code, except differences here (often due simply to spelling or different abbreviations for "street" or "mailbox") don't signify much. However, if they are the same that's a fairly strong indicator.

Nobel Peace Center, Oslo

Example data

Let's look at how this works on real data. I'll pass over the cases where the organization number is the same, and look at ones were the other attributes do the job. The data have been slightly modified so that I'm not exposing customer data on the web.

Property Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc. no177477707   0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1mb 113 mailbox 113 0.49
Address 2    0.5

This sums out to 0.93 probability, which is well above the threshold, as indeed it should be. You'll note that my cleaning of the address is not sophisticated enough, so it doesn't catch that ADDRESS1 is in fact the same. This is easy to fix.

The beauty of this approach is that it works astoundingly well for the amount of effort involved. After just two hours I find huge numbers of duplicates with low false positive rates. The main downside is that the pairwise comparison is slow for a table of 17,000 rows, but even this can be solved. And perhaps the best part is that once the basic (trivial) framework is up, doing another table takes just a few minutes.

Note that I'm not necessarily recommending that you do this yourself. There are real tools out there, both open source and commercial, which I'm sure do a vastly better job than a Python script cooked up in two hours, but I thought this was useful for illustrating the concept is, and how much you can get done with simple methods.

More advanced stuff

Of course, simply comparing the names as strings is not very sophisticated. What if one company is called "acme inc" and another simpy "acme"? This is what the name_compare function briefly referenced in the example takes care of. Essentially, it breaks up the names into tokens, then compares the sets of tokens. At the moment it does so fairly primitively, but it's quite effective. Unfortunately, it leads to some false positives at the moment. My next task is to figure out how to improve that.

This, by the way, is one area where the ready-made products shine. They understand misspellings, different transcriptions of the same names, matching of complex strings, and so on. So once you need these things (or performance) you may be better off with a ready-made product.

Update: In the end I built a proper entity resolution (or deduplication) engine in Java, now released as open source.







Comments

Ian Bashford - 2011-02-11 11:35:37

Interesting piece Lars. We used a similar technique in a British bank a few years ago to try to move them from an account based system to a customer based one. As years passed, customers were required to give more and more information to the bank when they opened new accounts. The same customer may be J Smith on one account, John Smith on another, John Simon Smith on a third, John Smith with DOB 20/03/71 on another etc. So trying to decide if they were all the same customer was hard...

In addition to the technique you describe above, we would allow a weighting for typos; so if the dates of birth were 20/03/71 and 20/03/77 we could still match them but with a slightly lower score. We also had to be careful of people who gave their children the same name as themselves. Two John Smiths at the same address with different birthdays might be father and son or just the same customer with a typo on one birthday.

orange - 2011-02-11 14:32:01

check out http://orange.biolab.si/ as a tool

mark - 2011-02-11 21:35:55

Just a thought for the name field, have you considered the soundex algorithm? It does a relatively good job at weeding out typos. Not perfect, but you could just down weight the relevance of its results.

Lars Marius - 2011-02-12 04:35:05

Ian: your handling of typos is interesting, and seems like a very good idea. I guess it illustrates how far you may need to go to squeeze the smallest hints out of the scarcity of available data. And also that this is all about probability at the end of the day. Even if the organization numbers match perfectly it may be simply because of a typo.

Mark: Yes, I did consider Soundex and similar algorithms, but so far all I've done is just some simple experimentation to judge the data quality and possible ways of dealing with it. I expect in the end we'll go with a tool that has fuzzy string matching built in. (And, yes, Soundex matches would be scored down, like you say.)

grove - 2011-02-13 11:25:36

JaroľWinkler distance and Levenshtein distance can be very useful for comparing the similarity of string values. Search engines use these to discover relevant information.

Lars Marius - 2011-02-13 13:45:16

grove: I did look at Levenshtein distance, but it seems to be purely a mathematical distance, not taking into account the likelihood of different modifications of a string. For example, in English substituting "c" for "s" is a lot more likely than substituting "k" for "s".

Jaro-Winkler distance (never heard of it before) seems to be the same.

But perhaps all this is all there is, and linguistic-based distance metrics are too difficult (or subjective) to design. It would in any case be interesting to try them both.

grove - 2011-02-14 04:15:37

Yes, they're mathematical distances. Perhaps they can be used with a phonetic algorithm like Metaphone and Soundex? http://en.wikipedia.org/wiki/Phonetic_algorithm

Heimo - 2011-02-20 14:34:35

Hi Lars-Marius. A good one again. We will need something like this internally at Tieto in our knowledge map exercise. Sure enough, information non-. And structured are often creted in silos even if related to the same subject. So far I thought we would need linguistic analysis approach using some more complex heuristic alorithms but lets try this approach first. Thnx Heimo

Scott - 2011-08-11 16:08:21

This is exactly what I need as an end-user. Are there any Window's based applications of this available?

Lars Marius - 2011-08-11 17:13:44

No, not really. There is a Java framework at http://code.google.com/p/duke/ but it does require some programming-like skills.

Marc Norlain - 2011-09-29 18:02:36

Just have a look at http://secondstring.sourceforge.net/ There is a good paper on string matching applied to entity names. Their combination of Jaro-Winkler and TFIDF is interesting.

Lars Marius - 2011-09-30 11:44:35

@Marc: Thanks for the tip. Have queued the paper for reading, and will look at the library, too.

Benjamin Hersey - 2012-01-19 19:06:18

Great blog, I am doing a bayesian project in graduate school and will be trying to use this process for my example, since I am a list manager at my dayjob.

It's all about the match!

Have you seen Oyster? http://identityresolutiondaily.com/870/oyster-a-configurable-er-engine/

Lars Marius - 2012-01-26 13:46:17

@Benjamin: Yes, I did look at Oyster, but the architecture seemed far too rigid for our uses, unfortunately.

Sergiy A - 2012-05-04 17:40:25

Lars,

I looked at the Duke source code to see how you combine the probabilities:

public static double computeBayes(double prob1, double prob2) { return (prob1 * prob2) / ((prob1 * prob2) + ((1.0 - prob1) * (1.0 - prob2))); }

Please explain how you get this formula because all your software is based on correctness of the formula. You mentioned that it is based Bayes but I do not see how the formula can be produced from Bayes theorem.

Lars Marius - 2012-05-05 05:08:12

@Sergiy: I've been asked this before, and tried to repeat the derivation of the formula. Sadly, I was not able to do it. It's possible that this is in fact *not* a correct use of Bayes's Theorem. It does work in practice, but I'd welcome suggestions on changing the formula.

AF - 2012-08-20 20:55:14

Hi, Very interesting, and along the lines of the other have you seen:

Emory University has an Open Source FRIL that I find very hard to use but applying the Genetic algo may be useful. I believe that from this the Free (but not open) US-CDC created Link Plus: http://fril.sourceforge.net/ http://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm

And of course Febrl from AU seems to be stale at this point as well, though they did get an update a year ago after about a 2 year hiatus but only in the code: http://sourceforge.net/projects/febrl/

I would love to create a fast FOSS best of breed MDM (Master Data Management). One of the most complete was MURAL in OpenESB from SUN based on their closed JAVA CAPS but the Oracle Purchase of Sun kind of killed all the goodness and light of Sun :) Pieces may be found: http://java.net/projects/mural http://openesb-dev.org/projects/mdm-legacy/

I will add duke to my list as well.

Lars Marius - 2012-08-21 02:38:25

Hi AF. I did look at those tools, but they were not good fits for our use case, I'm afraid. Febrl is Python, which is not suitable for the customer's infrastructure. FRIL looked better, but I couldn't see how to add my own data source, or how to run it on a server incrementally (meaning it runs continuously, looking for new data and updating the linkage as it goes).

Add a comment

Name required
Email optional, not published
URL optional, published
Comment
Spam don't check this if you want to be posted
Not spam do check this if you want to be posted

Last comments
RSS

Khaizarani Ibrahým bako on What is an informati...

Bruce on Equivalence classes

Stig on Bitcoin: promises an...

Jon Bjerkelien on The curse of NOARK

Lars Marius on Impressions from Str...

Aad Kamsteeg on Impressions from Str...

Lars Marius on The curse of NOARK

Dave Pawson on The curse of NOARK

james s. on 7 tips on writing cl...

Lars Marius on Active learning, alm...