C RUBY-ON-RAILS MYSQL ASP.NET DEVELOPMENT RUBY .NET LINUX SQL-SERVER REGEX WINDOWS ALGORITHM ECLIPSE VISUAL-STUDIO STRING SVN PERFORMANCE APACHE-FLEX UNIT-TESTING SECURITY LINQ UNIX MATH EMAIL OOP LANGUAGE-AGNOSTIC VB6 MSBUILD

# Python cosine-similarity on all possible pairs in list

By : Asparatame
Date : November 22 2020, 02:42 PM
I wish did fix the issue. You can make use of Python's groupby and combinations functions as follows:
code :
``````from itertools import groupby, combinations
import math

def cosine_similarity(v1,v2):
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx * sumyy)

info_list = [
('188.74.64.243', '1', [0, 1, 1, 0]),
('99.229.98.18',  '1', [0, 1, 1, 1]),
('86.41.253.102', '1', [1, 1, 1, 1]),
('188.74.64.243', '2', [0, 1, 1, 0]),
('99.229.98.18',  '2', [0, 1, 1, 1]),
('86.41.253.102', '2', [1, 1, 1, 1]),
]

for k, g in groupby(info_list, key=lambda x: x[1]):
for x, y in combinations(g, 2):
print (x[0], y[0], x[1], x[2], y[2], cosine_similarity(x[2], y[2]))
print
``````
``````('188.74.64.243', '99.229.98.18', '1', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '1', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '1', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)

('188.74.64.243', '99.229.98.18', '2', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '2', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '2', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
``````
``````for k, g in groupby(sorted(info_list, key=lambda x: x[1]), key=lambda x: x[1]):
``````

Share :

## cosine-similarity between consecutive pairs using whole articles in JSON file

By : Devon Dorrity
Date : March 29 2020, 07:55 AM
help you fix your problem I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided.
code :
``````## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from  sklearn.metrics.pairwise import cosine_similarity

with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]

## Defining our functions to filter the data

# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()

# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]

## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)

#cosine dists
similarity matrix  = cosine_similarity(tfidf_data)
``````

## Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

By : xincheng0125
Date : March 29 2020, 07:55 AM
seems to work fine The following is a minimal example to calculate the pairwise cosine similarities between a set of documents (assuming you have successfully retrieved the title and text from your database).
code :
``````from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)
data = [
'I like beer and pizza',
'I love pizza and pasta',
'I prefer wine over beer',
'Thou shalt not pass'
]

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)

'''
S looks as follows:
array([[ 1.        ,  0.4078538 ,  0.19297924,  0.        ],
[ 0.4078538 ,  1.        ,  0.        ,  0.        ],
[ 0.19297924,  0.        ,  1.        ,  0.        ],
[ 0.        ,  0.        ,  0.        ,  1.        ]])

The first row of `S` contains the cosine similarities to every other element in `X`.
For example the cosine similarity of the first sentence to the third sentence is ~0.193.
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones).
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''
``````

## Python, Cosine Similarity to Adjusted Cosine Similarity

By : SL3
Date : March 29 2020, 07:55 AM
To fix this issue Here's a NumPy based solution to your problem.
First we store rating data into an array:
code :
``````fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])
``````
``````M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
``````
``````indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:-1])
result = np.hstack((fruits[:, None], fruits[indices]))
``````
``````In [49]: M
Out[49]:
array([[ 0, 10,  0,  1,  0],
[ 6,  0,  0,  0,  2],
[ 1,  0, 20,  0,  1],
[ 0,  3,  6,  0, 18],
[ 3,  0,  2,  0,  0],
[ 0,  2,  0,  5,  0]])

In [50]: np.set_printoptions(precision=2)

In [51]: similarity_matrix
Out[51]:
array([[ 1.  ,  0.01, -0.41,  0.48, -0.44],
[ 0.01,  1.  , -0.57,  0.37, -0.26],
[-0.41, -0.57,  1.  , -0.56, -0.19],
[ 0.48,  0.37, -0.56,  1.  , -0.51],
[-0.44, -0.26, -0.19, -0.51,  1.  ]])

In [52]: result
Out[52]:
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
['Melon', 'Pear', 'Orange', 'Apple', 'Grape']],
dtype='|S6')
``````

## Cosine similarity for already known pairs of duplicates

By : Eric Graham
Date : March 29 2020, 07:55 AM
This might help you Since, there's no definitive answer yet, I'm getting the dataframe with all the rows (25 rows of result as in the example above) and inner-joining/merging it with a dataframe that has the list of duplicate pairs (i.e. the 5 rows of output that I need). That way, the resulting dataframe has the similarity scores for the duplicate document pairs. This is a temporary solution. If anyone can come up with a cleaner solution, I'll accept that as the answer, if it works.

## fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors

By : user1483488
Date : March 29 2020, 07:55 AM
Hope that helps This is the fastest way I have tried. Brought the calculation down from over 30 minutes in a loop to about 5 seconds:
code :
``````tempdf['vector_mult'] = np.multiply(tempdf['unit_vector'], tempdf['ave_unit_vector'])
tempdf['cosinesim'] = tempdf['vector_mult'].apply(lambda x: sum(x))
``````