The colors serve the purpose of giving a categorization of the alternation: typo, conventional variation, unconventional variation and totallly different.

library(stringdist)

Levenshtein: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.

## Levenshtein
stringdist("gato", "pato", method="lv")
##  1
stringdist("hola", "trola", method="lv")
##  2

Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.

## LCS (caracters que no son LCS)
stringdist("gato", "pato", method="lcs")
##  2
stringdist("hola", "trola", method="lcs")
##  3

Coseno: 1 minus the cosine similarity of both N-gram vectors.

cos_sim <- function(a, b) {
sum(a*b) / (sqrt(sum(a*a)) * sqrt(sum(b*b)))
}
qg <- qgrams("gato", "pato", q=1)
1 - cos_sim(qg[1,], qg[2,])
##  0.25
## Coseno
stringdist("gato", "pato", method="cosine")
##  0.25
stringdist("hola", "trola", method="cosine")
##  0.329

Jaccard: 1 minues the quotient of shared N-grams and all observed N-grams.

qg <- qgrams('gato', 'pato', q=2)
qg
##    ga to pa at
## V1  1  1  0  1
## V2  0  1  1  1
## Jaccard
stringdist("gato", "pato", method="jaccard", q=2)
##  0.5
stringdist("hola", "trola", method="jaccard")
##  0.5

Jaro calculates the number m of common characters that are within half the length of the longer string and the number of transpositions t. Jaro Winkler improves upon the Jaro algorithm by applying ideas based on empirical studies – Fewer errors occur at the beginning of strings. This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

Jaro-winkler dist = Jarodist+l(1−Jarodist) l is the length of common prefix at the start of the string up to a maximum of 4 characters

## Jaro Winkler
stringdist("gato", "pato", method="jw")
##  0.167
stringdist("hola", "trola", method="jw")
##  0.217
## Testar diferencias
lapply(c("lv", "lcs", "cosine", "jaccard", "jw"),
function(x) stringdist("gato", "pato", x))
## []
##  1
##
## []
##  2
##
## []
##  0.25
##
## []
##  0.4
##
## []
##  0.167