Project description
Given a database of important keywords and a database of < 110 million publications, the task is to identify the occurrence of the keywords in all publications and store this information in a SQL database. The Aho-Corasick algorithm is used to identify patterns between the two databases, with challenges being punctuation, big/small capital letters, multiple-word keywords etc..
Secondly, related keywords are identified using the MinHash technique. In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of elements of their intersection and the number of elements of their union. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the intersection and union.