Industry Similarity via Jaccard Index

Screen Shot 2015-05-01 at 11.52.32 AM

At Axial, we have a taxonomy tree for industries and want to know if one particular industry is more similar to another industry. The similarity of some of the industries are straightforward if they share a parent, but this similarity is not quantitative and does not produce a metric
on how similar the two industries are.

In the search page and other parts of the website, it would be useful for us to be able to
compare two different industries whether they belong to the same parent node or not. For example, if user chooses an industry in the on boarding process, we should be to recommend another industry based on the selected industry. This not only makes sure that user would choose consistent industries taxonomies but also exposes the similar industries she may not know. This is something we planned to do, but let’s look at how this feature is put into the production right now.

Industry Ordering In-App Search

In the app-in search page, when user selects a particular industry in the industry facet search,
we want to order the industries of the documents(campaigns, projects and companies)
based on the selected industry. Since we do not limit the number of industries for projects,
the ordering of the industries for the project would become quite handy to compare against
the industry facet.

Similarity in General

For other parts of the website, we could measure the similarity between industries when we want to compare how similar the two documents are in terms of industries they are assigned to or how good the match relationship between campaing and a project. Producing a similarity metric for industries gives a proxy on how similar two documents are.

Industry Similariy via Jaccard Index

In order to do so, we used Jaccard Index to measure similarities between industries based on
campaign keywords that are associated to each industry. Let’s review what a Jaccard Index is and
then I will explain how it is used to measure similarity between two industries.

Jaccard Index

Jaccard Index is a statistic to compare and measure how similar
two different sets to each other. It is a ratio of intersection
of two sets over union of them.
jaccard(A, B) = \frac{A \cap B}{A \cup B}

If you have representative finite number of elements for a particular
observation and you want to compare this observation with another
observation, you could count the number of items that are common to both of
these two sets. It is a natural fit for comparing posts if you know
the representative tags for the posts to measure how similar two articles
are in terms of tags.

Its python implementation is pretty trivial.

def jaccard_index(first_set, second_set):
""" Computes jaccard index of two sets
Arguments:
first_set(set):
second_set(set):
Returns:
index(float): Jaccard index between two sets; it is between 0.0 and 1.0
"""
# If both sets are empty, jaccard index is defined to be 1
index = 1.0
if first_set or second_set:
index = float(len(first_set.intersection(second_set))) / len(first_set.union(second_set))
return index

first_set = set(range(10))
second_set = set(range(5, 20))
index = jaccard_index(first_set, second_set)
print(index) # 0.25 as 5/20

Industry Similarity

I talked about a little bit on our industry taxonomy earlier not only we wanted to expose in different ways but also we want to measure how similar they are to each other. The problem is that we did not have a good way to compare two different industries. Since our taxonomy structure is a tree, we could group them by their parent nodes but not necessarily compare and measure how similar they are in a robust and reliable way.

Instead of coming up a heuristic based approach to measure similarity, we decided to use the user data.
We have keywords for campaigns that our members created. When members create a Campaign, they could enter a set of keywords along with industries that they chose. We decided to use this information to compare and measure a similarity between industries using campaign keywords.

The idea is simple; if two industries have a lot of common keywords given a campaign a profile, then the chances are they are closely related. As our members choose similar keywords for those industries to represent their campaigns, the likelihood of similarity only increases.
By using campaign keywords, we not only reduce the dimensionality of the text(generally the descriptions are much longer than campaign keywords) but also, we do already a feature selection as the campaign keywords should be much more descriptive, dense and rich in information than the descriptions.

Industry Similarity by Jaccard Index

In order to build an industry similarity measure, we first assigned the campaign keywords to each
industries. Then, for a given industry, we could compute the jaccard index in a very straightforward manner. But what if we want to compare multiple industries against all of the industries that we have in the database? We could still could use the jaccard index for multiple industry comparison even if it is not formally defined for multiple sets.

However, one can generalize Jaccard index which is very easy to do since what we do is only intersection and union operations across different sets, we could do this among multiple industries in our exmapleas in the following:

jaccard(A_1, \ldots, A_n) = \frac{A_1 \cap A_2, \ldots, A_{n_1} \cap A_n}{A_1 \cup A_2, \ldots, A_{n-1} \cup A_n}

This is pretty neat. Note that set order does not matter(icing on the cake).

What was available?

When we already have the intent of the user(industry facet), it is relatively easy to put that industry in the first place and the remainder industries would follow the first industry.

When user chooses Aerospace & Defense before industry ordering, we were displaying the industries of the documents in no particular order:

aerospace-defense-industry-ordering

 

With industry ordering, we now sort the industries by similarity:

aerospace-defense-after

 

 

Before “wine” search in Distillers & Vintners:

distillers-vintners-before-industry-ordering

 

After industry ordering is in place:

distillers-vintners-after

 

 

As mentioned before, this ordering is easy to extend to multiple industries as well:

two-industry-selection

 

Topic Modeling for Keyword-Phrase Extraction

blei_lda_illustration

We were frustrated about visibility of our taxonomy for industries to the user. If one of our member wanted to do a search in Axial for a particular field, they needed to know the exact taxonomy name that we use for that field. For example, if one wants to search wood and wood products, they need to know that those would fall in our “Forest Products” taxonomy, which is not an obvious thing when a user wants to do a search in our website.

Not only this limits the query capabilities of the user but also it degrades our search results as we do not know which industry they are interested in.

In order to tackle this problem and get the following benefits(listed below), we use topic modeling for a number of documents to extract topics and mine phrases to provide a better typeahead functionality to the user.

  • How do you extract phrases and keywords from a large number of documents?
  • How do you find recurring themes and topics from a corpus without using any metadata information(labels, annotation)?
  • How do you cluster a number of documents efficiently and make sure that clusters would be coherent themes?

Topic modeling is a method which is an unsupervised learning and clustering method which enables us to do things listed above.

If you want to deconstruct a document based on various themes it has, as shown above in the image, it is a great tool to explore topics and themes. In the image, every color corresponds to a particular theme and all of the themes have various words. But what does a topic look like?

Topics as Word Distributions

When you see the following words, what do you think:

wood pellet pellets energy biomass production tons renewable plant million fuel forest management heating development carbon facilities

if you think forest, wood or paper, you would be right. These are the subset of words that are extracted from Forest Products industry in opportunities that our members created.

Industry Aliasing

Previously, if our members wanted to search a particular industry, they needed to know the exact name of the industry in order to see the typeahead match in the search bar. We do matching by Named Entity Recognition in Query (NERQ) but it is limited to the exact keyword match in industries.

For example, if they want to do a search related to the “wine” industry, they need to know our taxonomy which corresponds to that industry which is “Distillers and Vintners”. Or, if they want to do a general search related to “shoes”, they need to know that we have a “Footwear” industry.

In order to remedy this problem, we expanded our industry matching to a larger number of words so that we could match “related” and “relevant” keywords to our taxonomies. When a user types in “wine”, we are able to match that keyword to our related taxonomy of “Distillers and Vintners”.

Topic Modeling for Keyword Extraction

We used topic modeling for keyword and phrase extraction using user generated documents that are classified by industry. This provides three main benefits. First, all of the keywords are data-driven and human generated. Second, since every document is associated with various industries, we do not need to associate the documents one by one to each topic, we could mine the keywords and phrases per industry. Last but not least, we could use the industry information as input to our topic sensitive ranking algorithm to improve on search precision.

We created a set of keywords/phrases (around 4000) to expand the matching between what a user types and which industry it will match. Since most of the keywords and phrases are descriptive of the industry itself, they would be intuitive to a user.

Topic Model

The grouping of relevant words is highly suggestive of an abstract theme which is called a topic. Based on the assumption that words that are in the same topic are more likely to occur together, it is possible to attribute phrases or keywords to a particular topic. This allows us to alias a particular topic with a number of phrases and words.

Not all words are created equal

As we are more interested in the more thematic and somehow specific topics, we are not necessarily interested in words that do not contribute a lot to various topics. Usual suspects are the articles (a, an, the) pronouns (I, you, she, he, we, …), prepositions(in, under, of, ..) and also common adverbs and more often than not verbs.

Oh, also the adjectives:

When you catch an adjective, kill it. No, I don’t mean utterly, but kill most of them–then the rest will be valuable. They weaken when they are close together. They give strength when they are far apart. — Mark Twain

Not only they do not contribute to the topics/themes at all, but also they disrupt the word distributions in each topic. Due to these reasons, common words should be removed prior to topic modeling. This is the first rule of thumb. We also removed rare word occurrences, smaller than 3 in the corpus, with the understanding that rare words do not materially contribute to topic distinction. This provides two additional benefits; first we do not have to deal with a large corpus as word distributions in corpora usually have long tails, second we do not unnecessarily do computations on the words classified as unimportant.

Unsupervised Nature of Topic Models

The topic models are unsupervised, i.e. they do not require any prior information on the documents of the corpus, e.g. descriptive labels or other classifications. It infers various topics and themes operating purely on the documents. It is a very powerful and useful tool for quickly exploring various themes in a large corpus. For example, if you are searching for a number of documents that are about one particular theme, e.g. “internet of things”, you want to get the documents that are about that theme (by increasing the recall) rather than documents including the exact phrase of “internet of things”.

Industry Aliasing

By doing so, we created a set of keywords/phrases(around 4000), compare with our industries(around ~200) to map and when you type “wine” in the search bar, you would get “Distillers and Vintners” industry.(yeah, it is hard to guess)

Screen Shot 2015-02-19 at 6.14.03 PM

Or, when you type “search engine” in search(so meta):

search-engine

 

data-science

Some more:

type-ahead-version-1
Adjectives are not so bad

Remember the adjectives, how useless they are for topic modeling. They could come handy in the conversations:

A man’s character may be learned from the adjectives which he habitually uses in conversation. — Mark Twain

Similarity in the Wild

opportunity_topic_rank_matrix

Opportunity Topic Rank Matrix

Larger Image

Finding similarity across observations is one of the most common tasks/projects which a data scientist does. Collaborative Filtering purely depends on finding similar items(videos for Netflix, products for Amazon) for users. If you are doing a classification task with KNN(K Nearest Neighbor), you are classifying the new observations purely on the distance that you have in the training set. Most of the instance based learning algorithms in one way or another is built on the similarity distances of observations. Clustering algorithms (k-means, manifold learning) depends on the distance between observations.

Similarity

Merriam Webster defines similarity as following:

a quality that makes one person or thing like another

So we want to find items that are similar to each other. But we need to first answer what an item is (Document representation) and how we will measure an item with other items(Distance Metric).

In order to measure the similarity between two observations, all of the observations should be defined in the same way(using a feature extraction method) to build a feature vector and a distance function which measures the distance between these feature vectors.

Document Representation or Feature Extraction

We have three different types in our observations(documents); tp, opp and companies. ‘tp’ stands for transaction profiles, ‘opp’ stands for opportunity profiles and ‘company’ stands for company(surprise!).

We are using a modified version of Topic-Sensitive Page-Rank to represent our documents regardless of their types. Not considering the types of the documents allow us to have same representation vectors that we could compare the documents regardless of their types.

Recently, we introduced Company Introduction feature for the tp owners to recommend companies that are registered to Axial. In order to do so, we need to find “similar companies” that are close to a given tp id. We have also boolean filters that we could use(we are filtering based on company type and industries in future), but after filtering, it pretty much depends on how similar a tp and a company.

Distance Metric

If feature extraction is an important step in any part of machine learning, the distance metric would be the second one. You could have the best feature vectors in the world, but if the distance metric you choose does not make a lot of sense for your feature set or the dimensions in the feature vectors, then the similarity would not make much sense.

For probability distributions, there are many ways to measure the distance(or similarity); l_p distances (l_1, l_2, Chebyshev), cosine, correlation, span-norm, Bhattacharyya, Hellinger and Jensen Shannon Divergence. Based on some experimentation, we decided to use Jensen Shannon Divergence(JSD) to measure the distance between documents.

Let’s talk about a little what JSD actually is.

Kullback-Leibler Divergence

Jensen Shannon Divergence is nothing but an average of two KL Divergence of two probability distributions with an average of the probability distributions. Its formula is in the following: KL(X || Y) = \displaystyle\sum_i X(i) ln \frac{X(i)}{Y(i)}. This is a nice way to measure the difference between a probability distribution comparing to Y which is a reference distribution. One way to reason about this distance metric is to assume two probability distributions are exactly the same. Then, ln \frac{X(i)}{Y(i)} would be zero. They are exactly same, so the distance is 0. Why ln, you may ask and that is related to information theory. KL Divergence is also called relative entropy, so one could think the KL divergence as how much information is gained from X assuming that Y is known. If they are same, information gain is zero.

Jensen-Shannon Divergence

KL Divergence is very nice in terms of what it measures, but it is not a metric that we could depend on. Why is that? It is hidden inside its asymmetric nature. KL(X || Y) \neq KL(Y || X) and that is a big problem as we cannot create a proper measure of between two observations without considering which is the reference and which one is the one that we measure the distance between the reference vector. In order to prevent this issue, there is a Symmetrised version(well sort of) which just sums up two different KL divergence between each other(KL(X || Y) + KL(Y || X) in order to reach a symmetrised version of KL Divergence, but we have another way to measure the distance as well, which is most probably obvious at this point.

Instead of looking at the distance between probability distributions to each other, what if we measure an average of them with every single of the probability distribution in order to have a symmetric distance metric.

JSD(X || Y) = \frac{1}{2} KL(X || A) + \frac{1}{2} KL(Q || A)
where A:
A = \frac{1}{2} (X+Y)
and it does not matter the order anymore:
JSD(X || Y) = JSD(Y || X)

Implementation

For a single TP, we first filter out the ones that do not obey the “criteria”(boolean filtering) and then compute the JSD Divergence of a TP for the target documents. The companies that are closest to the TP are our candidates that we should introduce them to the TP owner.

We are using Xapian as our search engine; this is relatively straightforward to implement as an External Posting Source and if your colleague has already implemented it and what you need to do is just refactor and take all the credits writing a blog post of it.

I get that going for me