com.swabunga.spell.engine
Class EditDistance
java.lang.Object
com.swabunga.spell.engine.EditDistance
- public class EditDistance
- extends java.lang.Object
This class is based on Levenshtein Distance algorithms, and it calculates how similar two words are.
If the words are identitical, then the distance is 0. The more that the words have in common, the lower the distance value.
The distance value is based on how many operations it takes to get from one word to the other. Possible operations are
swapping characters, adding a character, deleting a character, and substituting a character.
The resulting distance is the sum of these operations weighted by their cost, which can be set in the Configuration object.
When there are multiple ways to convert one word into the other, the lowest cost distance is returned.
Another way to think about this: what are the cheapest operations that would have to be done on the "original" word to end up
with the "similar" word? Each operation has a cost, and these are added up to get the distance.
- See Also:
Configuration.COST_REMOVE_CHAR
,
Configuration.COST_INSERT_CHAR
,
Configuration.COST_SUBST_CHARS
,
Configuration.COST_SWAP_CHARS
Field Summary |
static Configuration |
config
JMH Again, there is no need to have a global class matrix variable
in this class. |
Method Summary |
static int |
getDistance(java.lang.String word,
java.lang.String similar)
|
static void |
main(java.lang.String[] args)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
config
public static Configuration config
- JMH Again, there is no need to have a global class matrix variable
in this class. I have removed it and made the getDistance static final
DMV: I refactored this method to make it more efficient, more readable, and simpler.
I also fixed a bug with how the distance was being calculated. You could get wrong
distances if you compared ("abc" to "ab") depending on what you had setup your
COST_REMOVE_CHAR and EDIT_INSERTION_COST values to - that is now fixed.
WRS: I added a distance for case comparison, so a misspelling of "i" would be closer to "I" than
to "a".
EditDistance
public EditDistance()
getDistance
public static final int getDistance(java.lang.String word,
java.lang.String similar)
main
public static void main(java.lang.String[] args)
throws java.lang.Exception
- Throws:
java.lang.Exception