Similarity in the Wild


Opportunity Topic Rank Matrix

Larger Image

Finding similarity across observations is one of the most common tasks/projects which a data scientist does. Collaborative Filtering purely depends on finding similar items(videos for Netflix, products for Amazon) for users. If you are doing a classification task with KNN(K Nearest Neighbor), you are classifying the new observations purely on the distance that you have in the training set. Most of the instance based learning algorithms in one way or another is built on the similarity distances of observations. Clustering algorithms (k-means, manifold learning) depends on the distance between observations.


Merriam Webster defines similarity as following:

a quality that makes one person or thing like another

So we want to find items that are similar to each other. But we need to first answer what an item is (Document representation) and how we will measure an item with other items(Distance Metric).

In order to measure the similarity between two observations, all of the observations should be defined in the same way(using a feature extraction method) to build a feature vector and a distance function which measures the distance between these feature vectors.

Document Representation or Feature Extraction

We have three different types in our observations(documents); tp, opp and companies. ‘tp’ stands for transaction profiles, ‘opp’ stands for opportunity profiles and ‘company’ stands for company(surprise!).

We are using a modified version of Topic-Sensitive Page-Rank to represent our documents regardless of their types. Not considering the types of the documents allow us to have same representation vectors that we could compare the documents regardless of their types.

Recently, we introduced Company Introduction feature for the tp owners to recommend companies that are registered to Axial. In order to do so, we need to find “similar companies” that are close to a given tp id. We have also boolean filters that we could use(we are filtering based on company type and industries in future), but after filtering, it pretty much depends on how similar a tp and a company.

Distance Metric

If feature extraction is an important step in any part of machine learning, the distance metric would be the second one. You could have the best feature vectors in the world, but if the distance metric you choose does not make a lot of sense for your feature set or the dimensions in the feature vectors, then the similarity would not make much sense.

For probability distributions, there are many ways to measure the distance(or similarity); l_p distances (l_1, l_2, Chebyshev), cosine, correlation, span-norm, Bhattacharyya, Hellinger and Jensen Shannon Divergence. Based on some experimentation, we decided to use Jensen Shannon Divergence(JSD) to measure the distance between documents.

Let’s talk about a little what JSD actually is.

Kullback-Leibler Divergence

Jensen Shannon Divergence is nothing but an average of two KL Divergence of two probability distributions with an average of the probability distributions. Its formula is in the following: KL(X || Y) = \displaystyle\sum_i X(i) ln \frac{X(i)}{Y(i)}. This is a nice way to measure the difference between a probability distribution comparing to Y which is a reference distribution. One way to reason about this distance metric is to assume two probability distributions are exactly the same. Then, ln \frac{X(i)}{Y(i)} would be zero. They are exactly same, so the distance is 0. Why ln, you may ask and that is related to information theory. KL Divergence is also called relative entropy, so one could think the KL divergence as how much information is gained from X assuming that Y is known. If they are same, information gain is zero.

Jensen-Shannon Divergence

KL Divergence is very nice in terms of what it measures, but it is not a metric that we could depend on. Why is that? It is hidden inside its asymmetric nature. KL(X || Y) \neq KL(Y || X) and that is a big problem as we cannot create a proper measure of between two observations without considering which is the reference and which one is the one that we measure the distance between the reference vector. In order to prevent this issue, there is a Symmetrised version(well sort of) which just sums up two different KL divergence between each other(KL(X || Y) + KL(Y || X) in order to reach a symmetrised version of KL Divergence, but we have another way to measure the distance as well, which is most probably obvious at this point.

Instead of looking at the distance between probability distributions to each other, what if we measure an average of them with every single of the probability distribution in order to have a symmetric distance metric.

JSD(X || Y) = \frac{1}{2} KL(X || A) + \frac{1}{2} KL(Q || A)
where A:
A = \frac{1}{2} (X+Y)
and it does not matter the order anymore:
JSD(X || Y) = JSD(Y || X)


For a single TP, we first filter out the ones that do not obey the “criteria”(boolean filtering) and then compute the JSD Divergence of a TP for the target documents. The companies that are closest to the TP are our candidates that we should introduce them to the TP owner.

We are using Xapian as our search engine; this is relatively straightforward to implement as an External Posting Source and if your colleague has already implemented it and what you need to do is just refactor and take all the credits writing a blog post of it.

I get that going for me