Spark Item Similarity Interpretation (Cross-Similarity and Similarity)
By : Hans Hansen
Date : March 29 2020, 07:55 AM
I wish this help you In both cases the matrix is telling you that the item-id key is similar to the listed items by the LLR value attached to each similar item. Similar in the sense that similar users purchased the items. In the second case it is saying that similar people viewed the items and this view also appears to have led of a purchase of the same item. Cooccurrence works for purchases alone, cross-occurrence adds the check to make sure the view also correlated with a purchase. This allows you to use both for recommendations.
|
word2vec_basic output: trying to test word similarity versus human similarity scores
By : pavan teja
Date : March 29 2020, 07:55 AM
it should still fix some issue To answer my own question: Yes, the results are dismal, but that's because the model is too small and is trained on too little data. As simple as that. The implementation I experimented with uses a corpus of 17M words and runs for 100K steps, and takes just 2 adjacent words of context for an embedding size of 128. I got a larger Wikipedia sample with 124M words, increased context to 24 words (12 on each side), embedding size to 256, and trained for 1.8M steps, and voila! The correlation (as measured in my question above) grew to 0.24. I then implemented subsampling of frequent words as described in this tutorial and correlation jumped further to 0.33. Finally I left my laptop overnight to train with 36 words of context and 3.2M steps, and it got all the way to 0.42! I think we can call this success.
|
cosine similarity(patient similarity metric) between 48k patients data with predictive variables
By : user6951129
Date : March 29 2020, 07:55 AM
it helps some times First things first: You can find more rigorous treatments of cosine similarity at either of these posts: Find cosine similarity between two arrays Creating co-occurrence matrix code :
diasbp_rage <- diasbp_max - diasbp_min
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
|
Search the similarity of 2 strings in java using part of word matching, not cosine similarity
By : user3480692
Date : March 29 2020, 07:55 AM
hop of those help? For each search string, split it into words using haystack.split("\\s+") (\\s+ is regexp-ese for 'the strings are separated by whitespace'). Then, to obtain a 'score' you need 2 numbers: How many words matched, and how many words there are total. You sort descending on first, and ascending on last, which gets you the behaviour you seem to want. code :
String[] needle = "super cold white snow".split("\\s+");
String[] haystack = "white image superdupercold".split("\\s+");
int matchedWords = 0, totalWords = haystack.length;
for (String n : needle) {
boolean found = false;
for (String hay : haystack) {
if (hay.contains(n)) {
found = true;
break;
}
}
if (found) matchedWords++;
}
private static final long MULTIPLIER = 0x100000000L;
long score = MULTIPLIER * matchedWords + (Integer.MAX_VALUE - totalWords);
@Value
class Result { String needle; int words, total; }
list.sort(
Comparator.comparing(Result::getWords).reversed().
thenComparing(Comparator.comparing(Result::getTotal));
list.stream().map(Result::getNeedle).forEach(System.out::println);
|
How to estimate 2D similarity transformation (linear conformal, nonreflective similarity) in OpenCV?
By : vaibhav
Date : March 29 2020, 07:55 AM
|