By : Asparatame
Date : November 22 2020, 02:42 PM

I wish did fix the issue. You can make use of Python's groupby and combinations functions as follows: code :
from itertools import groupby, combinations
import math
def cosine_similarity(v1,v2):
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx * sumyy)
info_list = [
('188.74.64.243', '1', [0, 1, 1, 0]),
('99.229.98.18', '1', [0, 1, 1, 1]),
('86.41.253.102', '1', [1, 1, 1, 1]),
('188.74.64.243', '2', [0, 1, 1, 0]),
('99.229.98.18', '2', [0, 1, 1, 1]),
('86.41.253.102', '2', [1, 1, 1, 1]),
]
for k, g in groupby(info_list, key=lambda x: x[1]):
for x, y in combinations(g, 2):
print (x[0], y[0], x[1], x[2], y[2], cosine_similarity(x[2], y[2]))
print
('188.74.64.243', '99.229.98.18', '1', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '1', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '1', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
('188.74.64.243', '99.229.98.18', '2', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '2', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '2', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
for k, g in groupby(sorted(info_list, key=lambda x: x[1]), key=lambda x: x[1]):
Share :

cosinesimilarity between consecutive pairs using whole articles in JSON file
By : Devon Dorrity
Date : March 29 2020, 07:55 AM
help you fix your problem I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided. code :
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from sklearn.metrics.pairwise import cosine_similarity
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
## Defining our functions to filter the data
# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)
#cosine dists
similarity matrix = cosine_similarity(tfidf_data)

Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables
By : xincheng0125
Date : March 29 2020, 07:55 AM
seems to work fine The following is a minimal example to calculate the pairwise cosine similarities between a set of documents (assuming you have successfully retrieved the title and text from your database). code :
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Assume thats the data we have (4 short documents)
data = [
'I like beer and pizza',
'I love pizza and pasta',
'I prefer wine over beer',
'Thou shalt not pass'
]
# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TFIDF representation of the data, the first row of `X` corresponds to the first sentence in `data`
# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)
'''
S looks as follows:
array([[ 1. , 0.4078538 , 0.19297924, 0. ],
[ 0.4078538 , 1. , 0. , 0. ],
[ 0.19297924, 0. , 1. , 0. ],
[ 0. , 0. , 0. , 1. ]])
The first row of `S` contains the cosine similarities to every other element in `X`.
For example the cosine similarity of the first sentence to the third sentence is ~0.193.
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones).
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''

Python, Cosine Similarity to Adjusted Cosine Similarity
By : SL3
Date : March 29 2020, 07:55 AM
To fix this issue Here's a NumPy based solution to your problem. First we store rating data into an array: code :
fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])
M_u = M.mean(axis=1)
item_mean_subtracted = M  M_u[:, None]
similarity_matrix = 1  squareform(pdist(item_mean_subtracted.T, 'cosine'))
indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:1])
result = np.hstack((fruits[:, None], fruits[indices]))
In [49]: M
Out[49]:
array([[ 0, 10, 0, 1, 0],
[ 6, 0, 0, 0, 2],
[ 1, 0, 20, 0, 1],
[ 0, 3, 6, 0, 18],
[ 3, 0, 2, 0, 0],
[ 0, 2, 0, 5, 0]])
In [50]: np.set_printoptions(precision=2)
In [51]: similarity_matrix
Out[51]:
array([[ 1. , 0.01, 0.41, 0.48, 0.44],
[ 0.01, 1. , 0.57, 0.37, 0.26],
[0.41, 0.57, 1. , 0.56, 0.19],
[ 0.48, 0.37, 0.56, 1. , 0.51],
[0.44, 0.26, 0.19, 0.51, 1. ]])
In [52]: result
Out[52]:
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
['Melon', 'Pear', 'Orange', 'Apple', 'Grape']],
dtype='S6')

Cosine similarity for already known pairs of duplicates
By : Eric Graham
Date : March 29 2020, 07:55 AM
This might help you Since, there's no definitive answer yet, I'm getting the dataframe with all the rows (25 rows of result as in the example above) and innerjoining/merging it with a dataframe that has the list of duplicate pairs (i.e. the 5 rows of output that I need). That way, the resulting dataframe has the similarity scores for the duplicate document pairs. This is a temporary solution. If anyone can come up with a cleaner solution, I'll accept that as the answer, if it works.

fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors
By : user1483488
Date : March 29 2020, 07:55 AM



Related Posts :
