Update November 2014: Sift4 is here!! Check out the new improved version of Sift here: Super Fast and Accurate string distance algorithm: Sift4

Update October 6 2014: New stuff, compare Levenstein vs Sift here:

Algorithm:      Levenstein      Sift
String 1:   String 2:  

Result:



Update June 25th 2013: I've decided to play a little with the suggestions in the comments and check for validity. This was spurned by the realization that a lot of people use my algorithm. So, in order to celebrate this, here is the "3B" version of the Sift3 algorithm:
It is made in Javascript, this time, as it was easier to test and has the following extra features:

  • a maxDistance value that tells the algorithm to stop if the strings are already too different.
  • two pointers c1 and c2, rather than a single pointer c and two offsets
  • Instead of dividing to 2 the total length of the strings compared, now I divide it with 1.5. Why? Because this way the value is closer to the Levenshtein distance computed per random strings


Happy usage!
The variant I posted was totally buggy. I removed it. Just use sift3Distance.


Update: the Sift algorithm is now on Github.

A while ago I wrote an entry here about Sift2, an improvement of Sift, the original and silly string distance algorithm. Now I am publishing Sift3, which is way more accurate and even simpler as an algorithm.

I found out that my algorithm is part of a class of algorithms that solve the Longest Common Substring problem, therefore I calculated the LCS, not the distance, then the distance from the LCS. The result is way more robust, easy to understand and closer to the Levenshtein algorithm both on random strings and user databases. Not to mention that there is no goto in this one.

BTW, if you are looking for an algorithm that detects switched words, this is not it :) This just looks for typos and small regional differences between the strings. I mean, you could normalize the strings, so that words are ordered by some mechanism, then it would work because the words wouldn't be switched :)

I promise to work on a word switching algorithm, but not in the near future.
Without further ado, here is the code:

The C# code is a method in an object that has a private member maxOffset. As in Sift2 maxOffset should be around 5 and it represents the range in which to try to find a missing character.

public float Distance(string s1, string s2, int maxOffset)
{
if (String.IsNullOrEmpty(s1))
{
return String.IsNullOrEmpty(s2)
? 0
: s2.Length;
}
if (String.IsNullOrEmpty(s2))
{
return s1.Length;
}
int c = 0;
int offset1 = 0;
int offset2 = 0;
int lcs = 0;
while ((c + offset1 < s1.Length)
&&
(c + offset2 < s2.Length))
{
if (s1[c + offset1] ==
s2[c + offset2])
lcs++;
else
{
offset1 = 0;
offset2 = 0;
for (int i = 0;
i < maxOffset;
i++)
{
if ((c + i < s1.Length)
&&
(s1[c + i] == s2[c]))
{
offset1 = i;
break;
}
if ((c + i < s2.Length)
&&
(s1[c] == s2[c + i]))
{
offset2 = i;
break;
}
}
}
c++;
}
return (s1.Length + s2.Length)/2 - lcs;
}



And here is the T-Sql code. This version is actually an improvement of my original source, gracefully provided by Todd Wolf:

CREATE FUNCTION [DBO].[Sift3distance2]
(
@s1 NVARCHAR(3999),@s2 NVARCHAR(3999),@maxOffset INT
)
RETURNS FLOAT
AS
BEGIN
DECLARE @s1LEN INT,@s2LEN INT

SELECT @s1LEN=Len(Isnull(@s1,'')),@s2LEN=Len(Isnull(@s2,''))

IF @s1LEN=0 RETURN @s2LEN
ELSE
IF @s2LEN=0 RETURN @s1LEN

IF Isnull(@maxOffset,0)=0 SET @maxOffset=5

DECLARE @currPos INT,@matchCnt INT,@wrkPos INT,@s1Offset INT,@s1Char VARCHAR,@s1Pos INT,@s1Dist INT,@s2Offset INT,@s2Char VARCHAR,@s2Pos INT,@s2Dist INT

SELECT @s1Offset=0,@s2Offset=0,@matchCnt=0,@currPos=0

WHILE(@currPos+@s1Offset<@s1LEN AND @currPos+@s2Offset<@s2LEN)
BEGIN
SET @wrkPos=@currPos+1

IF(Substring(@s1,@wrkPos+@s1Offset,1)=Substring(@s2,@wrkPos+@s2Offset,1)) SET @matchCnt=@matchCnt+1
ELSE
BEGIN
SET @s1Offset=0

SET @s2Offset=0

SELECT @s1Char=Substring(@s1,@wrkPos,1),@s2Char=Substring(@s2,@wrkPos,1)

SELECT @s1Pos=Charindex(@s2Char,@s1,@wrkPos)-1,@s2Pos=Charindex(@s1Char,@s2,@wrkPos)-1

SELECT @s1Dist=@s1Pos-@currPos,@s2Dist=@s2Pos-@currPos

IF(@s1Pos>0 AND (@s1Dist<=@s2Dist OR @s2Pos<1) AND @s1Dist<@maxOffset) SET @s1Offset=(@s1Pos-@wrkPos)+1
ELSE
IF(@s2Pos>0 AND (@s2Dist<@s1Dist OR @s1Pos<1) AND @s2Dist<@maxOffset) SET @s2Offset=(@s2Pos-@wrkPos)+1
END

SET @currPos=@currPos+1
END

RETURN(@s1LEN+@s2LEN)/2.0-@matchCnt
END


It doesn't give the same exact results as my own code, yet the result is close enough and the speed is about 20% higher.

And thanks to Diogo Nechtan, the version in PHP:

function sift3Plus($s1, $s2, $maxOffset) {
$s1Length = strlen($s1);
$s2Length = strlen($s2);
if (empty($s1)) {
return (empty($s2) ? 0 : $s2Length);
}
if (empty($s2)) {
return $s1Length;
}
$c1 = $c2 = $lcs = 0;

while (($c1 < $s1Length) && ($c2 < $s2Length)) {
if (($d = $s1{$c1}) == $s2{$c2}) {
$lcs++;
} else {
for ($i = 1; $i < $maxOffset; $i++) {
if (($c1 + $i < $s1Length) && (($d = $s1{$c1 + $i}) == $s2{$c2})) {
$c1 += $i;
break;
}
if (($c2 + $i < $s2Length) && (($d = $s1{$c1}) == $s2{$c2 + $i})) {
$c2 += $i;
break;
}
}
}
$c1++;
$c2++;
}
return (($s1Length + $s2Length) / 2 - $lcs);
}


And thanks to Fernando Jorge Mota, the version in Python:



Also, here is the Javascript version, used in Mailcheck, by Derrick Ko and Wei Lu.

function sift3Distance(s1, s2) {
if (s1 == null || s1.length === 0) {
if (s2 == null || s2.length === 0) {
return 0;
} else {
return s2.length;
}
}

if (s2 == null || s2.length === 0) {
return s1.length;
}

var c = 0;
var offset1 = 0;
var offset2 = 0;
var lcs = 0;
var maxOffset = 5;

while ((c + offset1 < s1.length) && (c + offset2 < s2.length)) {
if (s1.charAt(c + offset1) == s2.charAt(c + offset2)) {
lcs++;
} else {
offset1 = 0;
offset2 = 0;
for (var i = 0; i < maxOffset; i++) {
if ((c + i < s1.length) && (s1.charAt(c + i) == s2.charAt(c))) {
offset1 = i;
break;
}
if ((c + i < s2.length) && (s1.charAt(c) == s2.charAt(c + i))) {
offset2 = i;
break;
}
}
}
c++;
}
return (s1.length + s2.length) / 2 - lcs;
}


Another implementation, this time in Java, by Eclesia:


You might also be interested in a customised version in AutoKey, by Toralf:



Thanks all for your comments and I look forward to more. Just tell me it worked or not and, most important, why. Good luck!

Update 2020 - most of the links here are dead, the things they referred to long forgotten. So much for "once you put it on the Internet it never disappears".

Having reached the 200th entry, I really wanted to write something cool, something interesting, something that sticks (and it ain't shit).

I thought of blogging Kartoo, a very nice - albeit slow - visual search engine that shows not only relevant links, but also the context items that link different pages.

But Kartoo is not personal enough, so I switched to YouTube, thought about blogging (yet another) female vocalist nu-metal with goth overtones band like the Italian band Exilia. Or something else, like the Turkish band maNga, or the Spanish Dead Stoned or Demiurgo. But this is a blog, not a video/music site.

Then I thought about programming; there must be something in the three projects I am working on worth blogging about, or at least something important like Don't use the .NET Random class when concerned about security. But then again, the blog is full of (I hope) interesting programming hints.

What else is there? Ranting about bycicle lanes the city hall is building on the sidewalk and on which old people are happy to walk (slowly) without losing themselves;
interesting conceptual games like BoomShine, Straight Dice or Stickman Fight and how they can be improved;
the BBC Baghdad Navigator, to show you the distribution and timeline of Baghdad bombings;
the Lilium song for the anime Elfen Lied;
the Coma article on Wikipedia (I didn't write it);
coming improvements in the Sift3 algorithm;
InuYasha manga reaching chapter 500;
the new Google/Kartoo/Wikipedia searches for any selected text in the blog;
how I am reading Il Nome de la Rosa and The Name of the Rose in the same time, trying to grasp more of the Italian language;
Gwoemul, a very nice South Korean film...

No, there is too much to choose and I can't decide. I think I will skip entry 200 entirely.

and has 1 comment

Reading this article on digg, I began searching the web for this very cool religion called Pastafarianism and I feel that it relates to me in a very spiritual way. In other words, it makes me laugh my ass off!

As you can read in the Wikipedia article, the Flying Spaghetti Monster created the world (or is it the other way around?) in order to prove to idiots that you either think or you believe. There is no middle ground. Thinking requires trusting your observations, emitting theories and then validating them by using observed data. Believing doesn't require anything, therefore being easier to do, and can (and most of the time will) deny your ability to observe, your capacity to reason or to grasp reality and look down on your desire to understand anything that is believed.

Well, seriously now. One cannot believe the world was created by the Spaghetti Monster... Or maybe one can, as long as they accept the obvious fact that the Spaghetti Monster was created by the Invisible Pink Unicorn.

You sometimes need to copy the exact structure of a database to an Sql2000 server, even if the source server is 2005.

Follow these steps:

  • open 2005 Sql Server Management Studio

  • right click on the offending database and go to Tasks -> Generate Scripts

  • do NOT check Script all objects in the selected database

  • click Next

  • set Include if NOT EXISTS to False

  • set Script for Server Version to SQL Server 2000

  • try to check only the objects and types of objects you actually need

  • create the script

  • delete all occurences of "WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)" from the generated script



Now the script should work on an SQL 2000 Server..
For the copying of data, the Server Management Studio has an option called Copy DatabaseExport Data, also in Tasks, that now accepts an Sql 2000 Server as destination.

and has 2 comments
In the Romanian jargon there is an insult: slave. It appeared ten years or so ago and it stuck. It probably came from Gypsy talk, probably holding more meaning to them, since they were liberated from slavery and into the worst social problem Romania has. But at least the cool ones are not slaves anymore, even if they recreate daily the hated cliche of the typical Gypsy Roma and are stuck in the mentality that work is somehow a shameful act.

But the word is also used by Romanians. You might see young people that have a little business or a scheme to get money quickly use it to refer to the people that are employed somewhere and go to work every day. And they are somehow right, since a lot of the rich people in Romania, business owners, top managers, land owners, etc. are uneducated folks. Instead of going to school, they chose to fight, risk, learn in the school of life. And it shows. They have a lot of money and no manners. They have a lot of opportunities, but don't really use them. They are no longer rude people with no money, they have money now, but are stuck into being the same people they started with. Kind of like the joke where the anus wanted to be the manager of the body.

How about the slave, then? The guy that goes to school, goes through all the (mechanical) motions of learning, passing exams, getting a job, living a "normal" life? Well, we are mostly wasting our time. The money we get are what we need to live, maybe even enough to get a car or, if one is lucky, an apartment. People like us spend their entire lives surviving and dreaming about what we would do if we had more money. Meanwhile, we lose 8 hours a day working, 3 preparing and going to work, 7 sleeping, 1 or 2 eating and are left with 4-5 hours in which to do "what we like". If that isn't slavery, I don't know what is, but in all that time we make the system work.

Slimy manager types that themselves work for uneducated bullies that somehow got into fortune work in the system as well. Poor 'unslavy' thieves occasionally steal something, thus making policemen work all day and accept bribes. Businesses run into the political framework maintained by corrupt politicians who themselves obey the laws of economics, which are heavily influenced by the people from other countries, who themselves are only cogs in this great machine we call humanity. And of course, all those good managers that work for great people and all the Gypsies that go to school and work and all those politicians who actually want to make a difference, too.

This is no freedom, we are all slaves. We allow ourselves to be blinded by a value system, be it invented by us or just stolen from someone else, and we live by it. We all choose our swords and then we let ourselves die by them. True freedom is inside, not outside, it's in the dreams, not in their realisation. I might even venture on saying that it's in the quantity of the dreams, not their quality, because that is what enslaves us, dreaming of a single thing, being desperate to achieve it and then to hold on to it.

and has 2 comments
I need your input, readers dear! I've changed the blog so that when you double click on a word a google window appears that you can expand, close or scroll at your leaisure. Do you like it? Would you like to dblclick and search this blog instead? Or maybe digg or something? Do you hate it? You want it removed? Does it hinder you in any way? Thanks.

and has 0 comments
I've found this interesting article by John Cronan about using the Abstract Factory pattern to access files, no matter if they are on FTP, HTTP or the local or networked file system. Basically he uses WebRequest.Create rather than any *Stream* class.

Interesting enough, he seems to be the only one providing a solution to the problem of accessing local file system resources when the default access rights do not allow you to, even if the logged on credentials would normally give you the access, thus solving an issue of the FileWebRequest class. Unfortunately he uses P/Invoke, which kind of contradicts the whole "more flexible than thou" approach of the article.

Overall an interesting read which gives you flexibility of file support, while taking away some of the specific advantages like seeking or appending. It's a definitely better approach than StreamReader and the ugly "URI formats are not supported." error.

A bonus for using this method is that it is compatible with the Office 2007/Vista Open Packaging addressing model, by way of the PackWebRequest class.

and has 1 comment

I've just finished watching "Free Energy - the Race to Zero Point", which is a documentary of sorts listing ideas of ways to produce free energy with open systems, or getting a lot more efficiency than present systems. The speakers are authors of controversial books and editors at magazines names as crackpotty as possible. The narrator himself looks like a Hitchcock wannabe, presenting the end of the world. Heck, the film is not even listed on Imdb, therefore this blog entry.
But, even if I am mostly convinced that this is a piece of sensationalist propaganda and not true science, I am left wondering how much (if any) of this is truly real? Did Moray have a device that lit up light bulbs without fuel or batteries? Are the numerous inventors presented there just crackpots or do they have something? I find it difficult to believe that all video proof that was presented in the movies was faked. Why would they?
Yet most of all I resonated with the idea that is, unfortunately for this movie, presented by all featured people: economic interests reign supreme and devices that don't need to be connected to power grids, use oil or that can be regulated by established industries are not only avoided, but actively attacked. It does make sense, doesn't it?

So, without further ado, here are some start up links from Wikipedia to help you make your own mind:
Zero-point energy
The Casimir effect
The Hutchison effect
Thomas Henry Moray
Cold fusion
Electrostatic levitation

and has 0 comments

Well, I don't. I was first shocked to find out that the 1918 "Spanish" Flu pandemic killed 50 million people and I found out about it only in my twenties. Now I see that the pandemics are recurring events, there are lists with the virus strains and where they originated, while information from before 1900 is unreliable since medicine was not really.



Check out this link that shows a history of flu strains and the three flu pandemics from the last century.

and has 0 comments
While listening to my favourite songs on Pandora, I heard a song that I really enjoyed. The band was The Provenance, from Gothenburg, Sweden. and I immediately started looking for more on the Internet. Here is one of the best songs I've heard in a while, with a video that could have been way better. The music, though, is worth it.

Catching Scarlet in the Sun - The Provenance

They have a site, but not very updated and, since they just released their fourth album but only joined YouTube in October 2006, I guess they are not really Internet people. So let's us lend them a little hand, shall we?
Official Web Site - Actually, their site is dead, their domain for sale.
MySpace site - ugh, it seems that the band has been... well... disbanded. Their last blog entry says as much: "bye".
YouTube user site

and has 0 comments
Of course, sooner or later YouTube blocked this video. Let's try something else:



It seems there is a fashion of combining English and Japanese in popular music in Japan, but this is really ridiculous. Just check out the lyrics: "Not a Chinaman 'cause I ain't from China, man... I am Japan, man.". Damn that's funny :))

Here are the complete lyrics, translation included

and has 0 comments
Enough with this! Geeks are not supposed to move, even use their hands to push something so small as a mouse. Moving a mouse all day builds muscle and you know that is bad! So check out the OCZ Neural Impulse Actuator at work. A head band, a wire, no movement. Geeky! I want one of those!

In other words: those curly bracket things in SQL. What? curly brackets in SQL? Yes! Imagine that :)

The idea is that most database systems adhere to the ODBC standard, at least ODBC 1.0. That means that, when you communicated with a database, you can send so called ODBC escape sequences that are translated into the SQL engine native objects.

Quick example: SELECT {d '2007-03-15'} will work in all ODBC 1.0 compliant DBMSs, including Microsoft SQL Server, MySql, PostgreSQL, Oracle, etc. and select a date object from 15 of March 2007, no matter the server configured country or language.

Interested yet? You can read the ODBC Programmer's Reference for more details. Short story shorter, here are the working and interesting parts (to me) of the ODBC escape sequences:
select {d '2007-02-13' }
select {t '22:20:30' }
select {ts '2007-02-13 22:20:30' }
select {fn curdate()}
select {fn curtime()}
select {fn User()}
select {fn Database()}
select {fn week(getdate())}
select {fn quarter(getdate())}
select {fn monthname(getdate())}
select {fn dayname(getdate())}
select {fn curdate()}
select {fn dayofweek(getdate())}
select {fn dayofyear(getdate())}
select {guid '12345678-1234-1234-1234-123456789012'}

Ok, so you have the greatest control library ever made and Microsoft releases Asp.Net Ajax and none of them work anymore. What is one to do?

Eilon Lipton to the rescue! He writes a very good article about Ajax enabling your controls without linking to the System.Web.Extensions dll.

However, the article is a bit outdated. Here is a piece of code that solves the problems (at least for the latest version of Asp.Net Ajax):
Type scriptManagerType = Type.GetType("System.Web.UI.ScriptManager, System.Web.Extensions, Version=1.0.61025.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35", false);
 if (scriptManagerType != null)
 {
 RegisterClientScriptResourceMethod = scriptManagerType.GetMethod("RegisterClientScriptResource", new Type[] { typeof(Control), typeof(Type),typeof(string) });
 RegisterStartupScriptMethod = scriptManagerType.GetMethod("RegisterStartupScript", new Type[] { typeof(Control), typeof(Type), typeof(string), typeof(string), typeof(bool) });
 }


This is because the namespace has changed since the writing of Elion's article from Microsoft.Web.UI to System.Web.UI and there are two methods named RegisterClientScriptResource and two named RegisterStartupScript so you have to get the right one. Else you get the "Ambiguous match found" error.

There you have it!

The .NET validation framework has two parts, the client Javascript validation and the server validation. That means that the Javascript code needs a value to validate and the server validation needs a property to validate.

So, first step, you create your web user control by putting some controls in it. Then, you want to add a validator to the page to reference the newly created user control. And you get the error "Control '{0}' referenced by the ControlToValidate property of '{1}' cannot be validated.". Why? because every control to be validated needs to be decorated with the ValidationProperty attribute:
[ValidationProperty("Text")]
public partial class ucDate : System.Web.UI.UserControl

Adding the first line to the control tells the validation framework to use the Text property of the UserControl.

Next step, you run the page and you notice the javascript doesn't work. The client validation works on html controls, by looking (recursively) for a 'value' attribute. When one looks at the source code, though, there is no html control that has the id of the user control. It doesn't use a span or a div to encapsulate its controls. All the controls have the id to show they are children to the user control, but the actual user control does not appear in the html source. So what is there to do?

<div id='<%=ClientID %>'></div>

You put all the controls in the ascx file of the User Control into this div. There you go! The validation works!

There is one more quirk regarding web user controls that have more children that render an html object with a 'value' attribute. In that case, remember that the validation starts from the very top, in our case the div. One could build simple javascript functions on the onchange or onsubmit javascript events, for example, to add a value attribute to the div. Best way would be using the onsubmit event, but be careful that the validation sequence also runs on the onsubmit event.

TextBox2.Attributes["onchange"]="document.getElementById('"+ClientID+"').value=this.value";


On popular demand, here is a complete example codeThis is a control that holds two TextBox controls. The control will be validated both on server and client by the value of the second Textbox, the first will be ignored.


using System;
using System.Web.UI;

[ValidationProperty("SecondTextboxValue")]
public partial class vuc : UserControl
{
public string SecondTextboxValue
{
get { return tbValidated.Text; }
}

protected void Page_Load(object sender, EventArgs e)
{
string script =
string.Format(
@"var vuc=document.getElementById('{0}');
var tb=document.getElementById('{1}');
if (vuc&&tb) {{
tb.vuc=vuc;
tb.onchange=function() {{ this.vuc.value=this.value; }}
}}"
,
ClientID, tbValidated.ClientID);

ScriptManager.RegisterStartupScript(Page, Page.GetType(), UniqueID + "_submit", script, true);
}
}


The ascx looks like this:
<%@ Control Language="C#" AutoEventWireup="true" CodeFile="vuc.ascx.cs" Inherits="vuc" %>
<div id="<%=ClientID %>"> value="<%=tbValidated.Text%>"
<asp:TextBox ID="tbIgnored" runat="server"></asp:TextBox>
<asp:TextBox ID="tbValidated" runat="server"></asp:TextBox>
</div>


How it works:
  • The javascript validator will look for an html element with the same id as the user control. If it has a value attribute, it will be validated, else it will go to the next control in the hierarchy. If the containing div would not have a value attribute, then the validation would have occured on the first textbox value, as the first element that has a value attribute. That's why the value attribute will be set on textbox change and when first loading the page.
  • The server validation will work because of the user control property that exposes the Text value of the second textbox and the ValidationProperty attribute that decorates the code.