Implementing similarity measures in python: Cosine Similarity versus Jaccard Similarity

big

Jaccard similarity and cosine similarity are two very common measurements while comparing item similarities and today, Similarity measures are used in various ways, examples include in plagiarism, asking a similar question that has been asked before on Quora, collaborative filtering in recommendation systems, etc. I’ll be implementing and comparing both measures in different cases. Sounds interesting? Jump on the ride!

Hopefully, the first paragraph has given you a basic understanding of similarity but in the official terms, Similarity is the measure of how much alike two data objects are. A similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be a high degree of similarity; if a distance is large, there will be a low degree of similarity.

Similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of color or size or taste.

Screen Shot 2017-07-31 at 10.40.07 PM

Ever wondered how Quora was able to suggest similar questions in this image? Let’s get started.

Using Quora data sets

Screen Shot 2017-08-01 at 10.23.43 AM

I’d like to believe the labeled column explains itself by their naming but is_duplicate is our binary target variable.

Similarity is measured in the range 0 to 1 [0,1].

  • Similarity = 1 if X = Y         (Where X, Y are two objects)
  • Similarity = 0 if X ≠ Y

Case Study:

text1 = 'How can I be a good geologist?' 
text2 = 'What should I do to be a great geologist?'

Moving forward, kindly note that machine learning models don’t work together with plain English and our data has to be translated to a language it can understand which are mathematical representations of this data. In our case study, we have two strings: text1 and text2 which are going to be represented in mathematical values for our tutorial in just 2 steps:

  1. Combining the two strings into a word vector space which is: [ HOW, CAN, I, BE, A, GEOLOGIST, WHAT, SHOULD, DO, TO]. This is simply making a set(unique values) after combining all the words in both strings.
  2. Document a matrix value based on the occurrence of each text against its’ set of vector space that we defined in step 1. Here’s an illustration below:

Screen Shot 2017-08-02 at 11.42.34 PM

Using S1 in this illustration above, you’ll agree with me that *HOW CAN I BE A GEOLOGIST* appeared in text1 but *WHAT SHOULD DO TO* doesn’t exist. Same principle to how we arrived at our S2.

Finally, we have our strings represented in a document matrix as seen below:

 text1 = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] 
 text2 = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

Voila, what we just did is translating our gibberish English to mathematical representations that our model will easily understand. In the next paragraph, I’d explain what the terms Jaccard similarity and cosine similarity means after which we’ll explore using both measures to determine if the two texts are the same.

The code for the illustration can be found here: https://github.com/andela-ysanni/document_to_matrix


 Cosine Similarity

In Physics, a vector has two things; magnitude and direction which can be written as

A vector is simply a quantity that has both magnitude and direction. In Computer Science, a 1-dimensional array is called a vector. Now we know what a vector is but how does it relate to Cosine Similarity. In a nutshell, Cosine Similarity is a measure that calculates the cosine of the angle between them.

cosine

In order to find the angle between the two vectors, we need to find the dot product of the two vectors as the formula besides the figure above.

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Implementing our case study with Cosine Similarity

Screen Shot 2017-08-03 at 12.24.23 AM

S1 = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0] 
S2 = [0, 0, 1, 1, 1, 1, 1, 1, 1, 1]


S1.S2 is the dot product of the matrices which results to 4.
|S1| is the square root of the magnitude of the values in S1.
|S2| is the square root of the magnitude of the values in S2.

Cosine similarity as seen above = 0.5774(in 4 decimal point)

Jaccard similarity

Jaccard Similarity also known as Jaccard index is a measure to find similarities between sets. So first, let’s learn the very basics of sets.

Sets: A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = {b,a}.

Cardinality: The Cardinality of A (denoted by |A|) counts how many elements are in A.

Intersection: An intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A, B.

Union: Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.

 A = [1,2,3,4], B = [1,2,6,9]
 A∩B = [1,2] and A∪B = [1,2,3,4,6,9]
 |A| = 4 (lengths of element in A)

Now going back to Jaccard similarity.The Jaccard similarity measures similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ratio of the cardinality of A ∩ B and A ∪ B.

Using our code snippet from above, it means:

 Jaccard Similarity = |A ∩ B| : |A ∪ B|
                    = length_of(A ∩ B) : length_of(A ∪ B)
                    = 2:6 (: also means division)
                    = 0.33

Implementing our two texts with Jaccard Similarity

With everything that has been explained about Jaccard similarity, we will now implement its formulae with our mathematical values.

 A = "How can I be a geologist?"
 B = "What should I do to be a geologist?"
 A ∩ B = ['a', 'be', 'geologist?', 'i']
 A U B = ['a', 'be', 'geologist?', 'to', 'do', 'i', 'what', 'should', 'how', 'can']
|A ∩ B| = 4 (Cardinality of the elements in the intersection of A and B)
|A ∪ B| = 10 (Cardinality of the elements in the union of A and B)

Jaccard Similarity = |A ∩ B| : |A U B|
                   = 4:10
                   = 0.40

The final answer we arrived at is 0.4 which is not even close to the target variable our data set tells us.


Pre-Conclusion

Based on the two texts strings we used as a case study, there are 3 known facts:

  • The value of our target variable from Quora says it’s 1.
  • Jaccard similarity results into 0.40
  • Cosine similarity results into 0.5774

Comparing the results of our case study from Jaccard similarity and Cosine similarity, Cosine similarity has a better score which is closer to our target variable. I also used few examples from the Quora data and found out cosine did a great job than Jaccard. If we also decide to approximated all values that are above 0.5 to 1 and less than 0.5 to 0, We can conclude cosine works better than Jaccard.

In Summary

We can’t for sure say cosine similarity is a bad measure similarity because it couldn’t predict the exact right value but we can see it’s close to 1. Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).  Simply put; in cosine similarity, the number of common attributes is divided by the total number of possible attributes. Whereas in Jaccard Similarity, the number of common attributes is divided by the number of attributes that exist in at least one of the two objects.

There are many other measures of similarity, each with its own eccentricities. When deciding which one to use, try to think of a few representative cases and work out which index would give the most usable results to achieve your objective.

The Cosine similarity could be used to identify plagiarism, but will not be a good index to identify mirror sites on the internet. Whereas the Jaccard similarity will be a good index to identify mirror sites, but not so great at catching copy pasta plagiarism (within a larger document).

When applying these indices, you must think about your problem thoroughly and figure out how to define similarity. Once you have a definition in mind, you can go about shopping for an index. We currently have several similarity measures like Euclidean distance, Manhattan distance, etc.

Implementation of both similarity index in Python can be found here.

Thanks for reading. Got any questions or constructive criticism? I’ll be in the comments section. :)

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *