logo
down
shadow

Python cosine-similarity on all possible pairs in list


Python cosine-similarity on all possible pairs in list

By : Asparatame
Date : November 22 2020, 02:42 PM
I wish did fix the issue. You can make use of Python's groupby and combinations functions as follows:
code :
from itertools import groupby, combinations
import math

def cosine_similarity(v1,v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx * sumyy)

info_list = [
    ('188.74.64.243', '1', [0, 1, 1, 0]),
    ('99.229.98.18',  '1', [0, 1, 1, 1]),
    ('86.41.253.102', '1', [1, 1, 1, 1]),
    ('188.74.64.243', '2', [0, 1, 1, 0]),
    ('99.229.98.18',  '2', [0, 1, 1, 1]),
    ('86.41.253.102', '2', [1, 1, 1, 1]),
    ]

for k, g in groupby(info_list, key=lambda x: x[1]):
    for x, y in combinations(g, 2):
        print (x[0], y[0], x[1], x[2], y[2], cosine_similarity(x[2], y[2]))
    print
('188.74.64.243', '99.229.98.18', '1', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '1', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '1', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)

('188.74.64.243', '99.229.98.18', '2', [0, 1, 1, 0], [0, 1, 1, 1], 0.8164965809277261)
('188.74.64.243', '86.41.253.102', '2', [0, 1, 1, 0], [1, 1, 1, 1], 0.7071067811865475)
('99.229.98.18', '86.41.253.102', '2', [0, 1, 1, 1], [1, 1, 1, 1], 0.8660254037844387)
for k, g in groupby(sorted(info_list, key=lambda x: x[1]), key=lambda x: x[1]):


Share : facebook icon twitter icon
cosine-similarity between consecutive pairs using whole articles in JSON file

cosine-similarity between consecutive pairs using whole articles in JSON file


By : Devon Dorrity
Date : March 29 2020, 07:55 AM
help you fix your problem I think, based on our discussion above, you need to change the foo function and everything below. See the code below. Note that I haven't actually run this, since I don't have your data and no sample lines are provided.
code :
## Loading the packages needed:
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
import json
from  sklearn.metrics.pairwise import cosine_similarity

with open('SDM_2015.json') as f:
    data = [json.loads(line) for line in f]

## Defining our functions to filter the data

# Short for stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()

# Short for removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

## First function that creates the tokens
def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

## tfidf
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf_data = vectorizer.fit_transform(data)

#cosine dists
similarity matrix  = cosine_similarity(tfidf_data)
Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables

Calculate cosine similarity of all possible text pairs retrieved from 4 mysql tables


By : xincheng0125
Date : March 29 2020, 07:55 AM
seems to work fine The following is a minimal example to calculate the pairwise cosine similarities between a set of documents (assuming you have successfully retrieved the title and text from your database).
code :
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assume thats the data we have (4 short documents)
data = [
    'I like beer and pizza',
    'I love pizza and pasta',
    'I prefer wine over beer',
    'Thou shalt not pass'
]

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(data) # `X` will now be a TF-IDF representation of the data, the first row of `X` corresponds to the first sentence in `data`

# Calculate the pairwise cosine similarities (depending on the amount of data that you are going to have this could take a while)
S = cosine_similarity(X)

'''
S looks as follows:
array([[ 1.        ,  0.4078538 ,  0.19297924,  0.        ],
       [ 0.4078538 ,  1.        ,  0.        ,  0.        ],
       [ 0.19297924,  0.        ,  1.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  1.        ]])

The first row of `S` contains the cosine similarities to every other element in `X`. 
For example the cosine similarity of the first sentence to the third sentence is ~0.193. 
Obviously the similarity of every sentence/document to itself is 1 (hence the diagonal of the sim matrix will be all ones). 
Given that all indices are consistent it is straightforward to extract the corresponding sentences to the similarities.
'''
Python, Cosine Similarity to Adjusted Cosine Similarity

Python, Cosine Similarity to Adjusted Cosine Similarity


By : SL3
Date : March 29 2020, 07:55 AM
To fix this issue Here's a NumPy based solution to your problem.
First we store rating data into an array:
code :
fruits = np.asarray(['Apple', 'Orange', 'Pear', 'Grape', 'Melon'])
M = np.asarray(data.loc[:, fruits])
M_u = M.mean(axis=1)
item_mean_subtracted = M - M_u[:, None]
similarity_matrix = 1 - squareform(pdist(item_mean_subtracted.T, 'cosine'))
indices = np.fliplr(np.argsort(similarity_matrix, axis=1)[:,:-1])
result = np.hstack((fruits[:, None], fruits[indices]))
In [49]: M
Out[49]: 
array([[ 0, 10,  0,  1,  0],
       [ 6,  0,  0,  0,  2],
       [ 1,  0, 20,  0,  1],
       [ 0,  3,  6,  0, 18],
       [ 3,  0,  2,  0,  0],
       [ 0,  2,  0,  5,  0]])

In [50]: np.set_printoptions(precision=2)

In [51]: similarity_matrix
Out[51]: 
array([[ 1.  ,  0.01, -0.41,  0.48, -0.44],
       [ 0.01,  1.  , -0.57,  0.37, -0.26],
       [-0.41, -0.57,  1.  , -0.56, -0.19],
       [ 0.48,  0.37, -0.56,  1.  , -0.51],
       [-0.44, -0.26, -0.19, -0.51,  1.  ]])

In [52]: result
Out[52]: 
array([['Apple', 'Grape', 'Orange', 'Pear', 'Melon'],
       ['Orange', 'Grape', 'Apple', 'Melon', 'Pear'],
       ['Pear', 'Melon', 'Apple', 'Grape', 'Orange'],
       ['Grape', 'Apple', 'Orange', 'Melon', 'Pear'],
       ['Melon', 'Pear', 'Orange', 'Apple', 'Grape']], 
      dtype='|S6')
Cosine similarity for already known pairs of duplicates

Cosine similarity for already known pairs of duplicates


By : Eric Graham
Date : March 29 2020, 07:55 AM
This might help you Since, there's no definitive answer yet, I'm getting the dataframe with all the rows (25 rows of result as in the example above) and inner-joining/merging it with a dataframe that has the list of duplicate pairs (i.e. the 5 rows of output that I need). That way, the resulting dataframe has the similarity scores for the duplicate document pairs. This is a temporary solution. If anyone can come up with a cleaner solution, I'll accept that as the answer, if it works.
fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors

fastest way to perform cosine similarity for 10 million pairs of 1x20 vectors


By : user1483488
Date : March 29 2020, 07:55 AM
Hope that helps This is the fastest way I have tried. Brought the calculation down from over 30 minutes in a loop to about 5 seconds:
code :
tempdf['vector_mult'] = np.multiply(tempdf['unit_vector'], tempdf['ave_unit_vector'])
tempdf['cosinesim'] = tempdf['vector_mult'].apply(lambda x: sum(x))
Related Posts Related Posts :
  • What are the centroid of k-means clusters with PCA decomposition?
  • How do mongoengine filter field not null?
  • Categorize results based on Model in haystack?
  • Error installing pycrypto on my mac
  • Can Django ORM has strip field?
  • Python pack / unpack converts to Objective C
  • Python - Selenium Locate elements by href
  • Couldn't iterate over a dictionary context variable in template, despite having all in place, as far as I know?
  • Test if Django ModelForm has instance on customized model
  • Reading excel column 1 into Python dictionary key, column 2 into value
  • AttributeError: 'module' object has no attribute 'timeit' while doing timeit a python function
  • Accessing button using selenium in Python
  • Removing White Spaces in a Python String
  • Sort timestamp in python dictionary
  • How to use Python 2 packages in Python 3 project?
  • retrieve links from web page using python and BeautifulSoup than select 3 link and run it 4 times
  • applying lambda to tz-aware timestamp
  • Having two Generic ListViews on the same page
  • Merging numpy array elements using join() in python
  • pythonic way to parse/split URLs in a pandas dataframe
  • Added iterating over page id in Scrapy, responses in parse method no longer run
  • wanting to add an age gate to my quiz
  • Removing top empty line when writing a text file Python
  • How to use a template html in different folder on Google App Engine python?
  • Access ndarray using list
  • unable to post file+data using python-requests
  • How to test aws lambda functions locally
  • inconsistent plot between matplotlib and seaborn in Python
  • How matplotlib show obvious changes?
  • Project in Python3, reading files, word data
  • Check for specific Item in list without Iteration or find()
  • Unicode encoding when reading from text file
  • Overloaded variables in python for loops?
  • All elements have same value after appending new element
  • Python Threading loop
  • `_pickle.UnpicklingError: the STRING opcode argument must be quoted`
  • Python: How to stop a variable from exceeding a value?
  • python textblob and text classification
  • Django - Context dictionary for attribute inside a class
  • Database is not updated in Celery task with Flask and SQLAlchemy
  • Shapely intersections vs shapely relationships - inexact?
  • How to extract a percentage column from a periodic column and the sum of the column?
  • Zombie ssh process using python subprocess.Popen
  • Python regex to capture a comma-delimited list of items
  • joining string and long in python
  • Value Error in python numpy
  • Check if any character of a string is uppercase Python
  • TensorFlow - why doesn't this sofmax regression learn anything?
  • Python Anaconda Proxy Setup via .condarc file on Windows
  • Creating django objects from emails
  • Get spotify currently playing track
  • Select multiple columns and remove values according to a list
  • Python - How to Subtract a Variable By 1 Every Second?
  • Tkinter unable to alloc 71867 bytes
  • How to add Variable to JSON Python Django
  • CSRF token missing or invalid Django
  • Python: writing to a text file
  • Extracting multiple rows from pandas dataframe and converting to columns
  • Pinging a remote PC with Flask, causing server to block
  • Making a fractal graph using a 2D array?
  • shadow
    Privacy Policy - Terms - Contact Us © animezone.co