Today I published a raw version of a program that solves Pixelo puzzles, a Flash version of the game generally known as Nonogram, Picross, Hanjie, etc.. I was absolutely fascinated by the game, not only the concept, but also the attention to details that Tamaii lent to the game. The program is called Pixelo Solver and you can find it at Github, complete with source code.


I liked working on this because it covers several concepts that I found interesting:
  • Responding to global key combinations
  • Getting a snapshot of the screen and finding an object in it
  • Parsing a visual image for digits in a certain format
  • The algorithm for solving the puzzle in a reasonable amount of time and memory (it was not trivial)

Quick how-to: get all the files from here and copy them in a folder, then run PixeloTools.GameSolver.exe and press Ctrl-Shift-P on your puzzle in the browser.

Usage:
  1. Start PixeloTools.GameSolver.exe
  2. Open the browser to the Pixelo game
  3. Start a puzzle
  4. Press Ctrl-Shift-P

However, if the list of numbers per row or column is too long, the automated parsing will not work. In that case, you may use it offline, with a parameter that defines a file in the format:
5 1 10, 3 3 7, 2 5 4, 2 5 2, 2 1 1 1, 1 1 1 1 1, 1 1 3 1 1, 1 1 1 1 1, 1 1 1, 1 1, 1 1, 2 2 2 4, 2 1 1 1 2 4, 2 1 2 1, 7 4, 2 2 1 2 2 2, 2 1 1 1 1 1 1, 2 1 1 4 2, 1 3 4 2 1, 1 1
8 1 1, 5 2 2 1, 2 1 3, 1 1, 1 7 4 1, 3 5 1, 3 1 1 3, 4 3 4, 3 1 1 3, 3 5 1, 1 7 4 1, 1 2, 1 1 1, 2 2 2, 2 1 1 2, 2 1 2 2, 3 2 1, 3 1 1, 4 1 2 2, 9 2 3
where the first line is the horizontal lines and the second line the vertical lines. When the parsing fails, you still get something like this in the output of the program. You can just copy paste the two lines in a file, edit it so that it matches the puzzle, then run:
start "" PixeloTools.GameSolver.exe myFile.txt

The file can also be in XML format, same as in the file the game loads. That's for Tamaii's benefit, mostly.

Let me know what you think in the comments below.

I was trying to do a simple thing: configure a daily rolling log for log4net, meaning that I wanted that log files would be created daily and the name of the files would contain the date. The log was already configured and working with a normal FileAppender, so all I had to do was find the correct configuration. There are several answers on the Internet regarding this. I immediately went to the trusty StackOverflow and read the first answers, copy pasted lazily, and it seemed to work. But it did not. So warning, the first answer on StackOverflow about this is wrong.

So, my logs had to be named something like Application_20150420.log. That means several things:
  • The name of the appender class has to be set to log4net.Appender.RollingFileAppender
  • The name of the log files need to start with Application_ and end in .log
  • The name of the log files need to contain the date in the correct format
  • The logger needs to know that the files need to be created daily

This is the working configuration:
<appender name="FileAppender" type="log4net.Appender.RollingFileAppender">
<filter type="log4net.Filter.LevelRangeFilter">
<acceptOnMatch value="true" />
<levelMin value="DEBUG" />
</filter>

<file type="log4net.Util.PatternString" value="c:\logfiles\Application_.log" />
<appendToFile value="true" />
<rollingStyle value="Date" />
<datePattern value="yyyyMMdd" />
<preserveLogFileNameExtension value="true"/>
<staticLogFileName value="false" />

<lockingModel type="log4net.Appender.FileAppender+MinimalLock"/>
<layout type="log4net.Layout.PatternLayout">
<conversionPattern value="%date %-5level %logger - %message%newline"/>
</layout>
</appender>

As you can see, it is a little counter-intuitive. You do not need to specify the date in the file name, it will be added automatically, and you absolutely need to use preserveLogFileNameExtension, otherwise your files would look like Application_.log20140420

I was trying to solve another problem, that operations creating, altering or dropping databases cannot be made inside a .Net transaction. Therefore I created a new TransactionScope inside the main one, using the TransactionScopeOption.Suppress option, but then I started getting this weird exception: The transaction associated with the current connection has completed but has not been disposed. The transaction must be disposed before the connection can be used to execute SQL statements. But I did Complete and Dispose all my transaction scopes, so what was going on? Long story short, this is really confusing message for a simple problem: your transaction probably timed out. Create the transaction scope with a large TimeSpan as the Timeout and it will get through. If you can, you can use a try/catch/finally block in which you Dispose the transaction (just remember that a using construct is basically a try/finally block). In my case, I was conditionally creating this new TransactionScope, so I wasn't using using. The transaction would fail, bleed into the other transaction where the confusing exception was being thrown.

I am relatively new to the entire NuGet ecosystem. What I expected is for things to just work. You know... Microsoft. However the web of interdepencies seems to be too much even for them. The problems that appear when updating MVC versions, .NET Framework versions, etc, are as annoying as they are unclear. One example: I was trying to publish a project that worked perfectly on my system. I moved it to the server machine, and weird things began to happen. The most annoying of them all is that the errors that occur do that at runtime instead of at compile time. Such an error was "Could not load file or assembly System.Web.WebPages, Version=3.0.0.0...". The project reference for WebPages was 2.0.0.0. If I removed it and tried to add a reference, only versions 1.0.0.0 and 2.0.0.0 were available. Meanwhile Razor was complaining that it didn't find the 3.0.0.0 version of WebPages.

Long story short: don't try to resolve only the library dependencies, but also the framework dependencies. In this case, System.Web.WebPages 3.0.0.0 is only available for the .NET framework 4.5.1. The project was configured as 4.5. Updating the MVC framework after the change to .NET 4.5.1 solved it.

Steps:
  • Change the project version to 4.5.1 (or whatever the newest usable .NET framework version)
  • go to the NuGet Package Manager Console in Visual Studio
  • Run command Update-Package -reinstall Microsoft.AspNet.Mvc

This, of course, is not a panacea for all problems, but just remember that the .NET framework is important in these scenarios.

and has 0 comments
The wonder of .Net is that most of the time we don't really have to care about stuff like memory allocation, the framework does everything for you. One especially annoying thing, though, is when you are using a basic data structure that is supposed to be efficient and you get stuff like OutOfMemoryException. One of the cases is List<T> which in the background uses one big array. This means two things: one is that certain modifying operations are slow and the other is that it requires contiguous memory space for the items. If your memory space gets too fragmented, then there is not enough to allocate for hundred of thousands of items, even if in that memory space you only need a pointer for each item. That is why you end up with out of memory exceptions. It's not like you don't have enough memory, it's that you don't have a big enough contiguous block of it.

As a solution I give you... the BucketList<T> class. It has a bucket size that defaults to 10000 and it implements a list of lists that each will always have at most that amount of items as specified in the bucket size. This way operations that remove and add items will only operate on 10000 item big arrays and there is no need for only one big memory block. I implemented the IList interface explicitly, so that you will never find it comfortable to use an instance as a BucketList, but as an IList. This way you can replace the implementation of the interface with a normal List or whatever other form you like. Enjoy!

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Siderite.DataStructures
{
/// <summary>
/// IList implementation that doesn't require a contiguous memory space (an array)
/// </summary>
/// <typeparam name="T"></typeparam>
public class BucketList<T>:IList<T>
{
private int _bucketSize;
private List<List<T>> _list;
private int _count;

/// <summary>
/// Default constructor
/// </summary>
public BucketList():this(10000)
{
_list = new List<List<T>>();
}

/// <summary>
/// Specify the bucket size (default 10000)
/// </summary>
/// <param name="bucketSize"></param>
public BucketList(int bucketSize)
{
_bucketSize = bucketSize;
}

/// <summary>
/// Create a bucket list from an IEnumerable
/// </summary>
/// <param name="enm"></param>
public BucketList(IEnumerable<T> enm):this()
{
var list = (IList<T>)this;
foreach (var itm in enm)
{
list.Add(itm);
}
}

/// <summary>
/// The item count
/// </summary>
public int Count
{
get
{
return ((IList<T>)this).Count;
}
}

#region IList<T> implementation

int IList<T>.IndexOf(T item)
{
var index = 0;
for (var i = 0; i < _list.Count; i++)
{
var idx = _list[i].IndexOf(item);
if (idx < 0)
{
index += _list[i].Count;
}
else
{
index += idx;
return index;
}
}
return -1;
}

void IList<T>.Insert(int index, T item)
{
var idx = 0;
for (var i = 0; i < _list.Count; i++)
{
var lst = _list[i];
if (index < idx + lst.Count)
{
lst.Insert(index - idx, item);
splitListIfTooLarge(i);
_count++;
return;
}
else
{
idx += lst.Count;
}
}
throw new IndexOutOfRangeException("index");
}

void IList<T>.RemoveAt(int index)
{
var idx = 0;
for (var i = 0; i < _list.Count; i++)
{
var lst = _list[i];
if (index < idx + lst.Count)
{
lst.RemoveAt(index - idx);
removeListIfEmpty(i);
_count--;
return;
}
else
{
idx += lst.Count;
}
}
throw new IndexOutOfRangeException("index");
}

T IList<T>.this[int index]
{
get
{
var idx = 0;
for (var i = 0; i < _list.Count; i++)
{
var lst = _list[i];
if (index < idx + lst.Count)
{
return lst[index - idx];
}
else
{
idx += lst.Count;
}
}
throw new IndexOutOfRangeException("index");
}
set
{
var idx = 0;
for (var i = 0; i < _list.Count; i++)
{
var lst = _list[i];
if (index < idx + lst.Count)
{
lst[index - idx]=value;
}
else
{
idx += lst.Count;
}
}
throw new IndexOutOfRangeException("index");
}
}

void ICollection<T>.Add(T item)
{
List<T> lst;
int index;
if (_list.Count == 0)
{
lst = new List<T>();
_list.Add(lst);
index=0;
}
else
{
index=_list.Count - 1;
lst = _list[index];
}
lst.Add(item);
splitListIfTooLarge(index);
_count++;
}

void ICollection<T>.Clear()
{
_list.Clear();
_count = 0;
}

bool ICollection<T>.Contains(T item)
{
return ((IList<T>)this).IndexOf(item) > -1;
}

void ICollection<T>.CopyTo(T[] array, int arrayIndex)
{
var index = arrayIndex;
foreach (var lst in _list)
{
lst.CopyTo(array, index);
index += lst.Count;
}
}

int ICollection<T>.Count
{
get
{
return _count;
}
}

bool ICollection<T>.IsReadOnly
{
get { return false; }
}

bool ICollection<T>.Remove(T item)
{
for (var i = 0; i < _list.Count; i++)
{
var lst = _list[i];
if (lst.Remove(item))
{
_count--;
removeListIfEmpty(i);
return true;
}
}
return false;
}

IEnumerator<T> IEnumerable<T>.GetEnumerator()
{
foreach (var lst in _list)
{
foreach (var item in lst)
{
yield return item;
}
}
}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return ((IList<T>)this).GetEnumerator();
}

#endregion

private void splitListIfTooLarge(int listIndex)
{
var lst = _list[listIndex];
if (lst.Count > _bucketSize)
{
var newList = lst.GetRange(0, _bucketSize);
lst.RemoveRange(0, _bucketSize);
_list.Insert(listIndex, newList);
}
}

private void removeListIfEmpty(int listIndex)
{
var lst = _list[listIndex];
if (lst.Count ==0)
{
_list.RemoveAt(listIndex);
}
}
}
}

I have been working on a REST API lately and, while using Entity Framework or some other similar framework to abstract the database is certainly possible, I wanted to control every aspect of the implementation. I know, reinventing wheels, but this is how one learns. One of the most annoying bits was trying to translate some complex object from JSON to the (two dimensional) database relational tables. This post will explore my attempts and the solutions I have found.

My first attempt was straightforward: just send all types as DataTables, with some extra property to define identity and parent entity. This relies on the Microsoft Server SQL mechanism that allows sending of table variables to stored procedures. But this approach has several downsides. One of them is that in order to send a datatable to SQL you need... DataTables. As I have pointed out in several blog posts, the DataTable object is slow, and sometimes downright buggy. Even if I didn't care about performance that much, in order for SQL to receive the content of the DataTable one must create corresponding User Defined Types on the database side. Working with UDTs is very difficult for several reasons: you cannot alter a UDT (unless employing some SQL voodoo that changes system tables), you can only drop it and recreate it. This does not work if you use the UDT anywhere, so a lot of renaming needs to be done. Even if you automate the process, it's still very annoying. Then the UDT definition has to be an exact duplicate of the DataTable definition. Move some columns around and it fails. Debugging is also made difficult by the fact that the SQL profiler does not see the content of table variables when sending them to the server.

Long story short, I was looking for alternatives and I found XML. Now, you might think that this leads to a simple, and maybe even obvious, solution. But it's not that easy. Imagine that you send a list of objects in an XML. Each object is represented by an XML element and each property by a child element. In order to get the value of a property you need to do iterate through all the nodes, for each node find the properties, for each property find the one element that defines it, then get the attribute value or content of the property, all while making sure you select everything in a table. It's not that easy.

The solution I found, which simplifies the SQL code (and hopefully brings some well needed performance to the table) is to serialize the objects in a way that makes the selection simple enough. Here is an example: I have a Configuration object with an Id and a Name that also has a property called Servers, containing Server objects having an Id and a Url. Here is an example of XML serialization from the DataContractSerializer:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Id>1</Id>
<Name>Test Config</Name>
<Servers>
<Server>
<Id>1</Id>
<Url>http://some.url</Url>
</Server>
<Server>
<Id>2</Id>
<Url>http://some.other.url</Url>
</Server>
</Servers>
</Configuration>

The SQL code to get the information from an XML variable with this content would look like this:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Id>1</Id>
<Name>Test Config</Name>
<Servers>
<Server>
<Id>1</Id>
<Url>http://some.url</Url>
</Server>
<Server>
<Id>2</Id>
<Url>http://some.other.url</Url>
</Server>
</Servers>
</Configuration>'


;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT') as Id,
T.Item.value('(Name/text())[1]','NVARCHAR(100)') as Name
FROM @Xml.nodes('//Configuration') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('(Id/text())[1]','INT') as Id,
T.Item.value('(Url/text())[1]','NVARCHAR(100)') as Url
FROM @Xml.nodes('//Configuration/Servers/Server') as T(Item)


This works, but look at that code. In my case, the situation was worse, the object I was using was a wrapper which implemented IDictionary<string,object> and, even if it did implement ISerializable, both XmlSerializer and DataContractSerializer use the dictionary as their data and in the end I get ugly key elements and value elements that are even harder to get to and, I suppose, more inefficient to parse. Therefore I found the solution in IXmlSerializable, (yet) another serialization interface used exclusively by XML serializer classes. If every simple value would be saved as an attribute and every complex object in an element, then this could be the SQL code:

DECLARE @Xml XML='<?xml version="1.0" encoding="UTF-8"?>
<Configuration xmlns="http://schemas.datacontract.org/2004/07/SerializationTest" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<Config Id="1" Name="Test Config">
<Servers>
<List>
<Server Id="1" Url="http://some.url" />
<Server Id="2" Url="http://some.other.url" />
</List>
</Servers>
</Config>
</Configuration>'


;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT') as Id,
T.Item.value('@Name','NVARCHAR(100)') as Name
FROM @Xml.nodes('//Configuration/Config') as T(Item)

;WITH XMLNAMESPACES(DEFAULT 'http://schemas.datacontract.org/2004/07/SerializationTest')
SELECT T.Item.value('@Id','INT') as Id,
T.Item.value('@Url','NVARCHAR(100)') as Url
FROM @Xml.nodes('//Configuration/Config/Servers/List/Server') as T(Item)


Much easier to read and hopefully to parse.

I am not going to write here about the actual implementation of IXmlSerializable. There are plenty of tutorials on the Internet about that. It's not pretty, :) but not too difficult, either.

What was the purpose of this exercise? Now I can send a complex object to SQL in a single query, making inserts and updates simple and not requiring at a call for each instance of each type of complex object. Now, is it fast? I have no idea. Certainly if performance is needed, perhaps the UDT/DataTable approach is faster. However you will have to define a type for each type that you send as a DataTable to a stored procedure. An alternative can be a binary serializer and a CLR SQL function that translates it into tables. However, in my project I need to easily implement very custom API methods and to control every aspect, including tracing and profiling the various SQL calls. I believe the customized IXmlSerializable/XML in SQL approach is a reasonable one.

As always, I hope this helps someone.

and has 0 comments
I am going to tell you about how I worked on an object that can wrap any object, serialize it, deserialize it, send it to a database, all efficiently and generically. But first, let me start with the beginning: you need a way to efficiently set/get values of properties in objects of different types. That's where TypeCache comes in. I am sure not everything is perfect with the class, but hopefully it will put you on the right track.

I will start with some of the code, the class that does everything. Something will be missing, though.
/// <summary>
/// It caches the property setters and getters for the public properties of a type
/// </summary>
public class TypeCache
{
private static ConcurrentDictionary<Type, TypeCache> _typeCacheDict = new ConcurrentDictionary<Type, TypeCache>();

public static TypeCache Get(Type type)
{
TypeCache cache;
if (!_typeCacheDict.TryGetValue(type, out cache))
{
cache = new TypeCache(type);
_typeCacheDict[type] = cache;
}
return cache;
}

private TypeCache(Type type)
{
Type = type;
Setters = new ConcurrentDictionary<string, Action<object, object>>();
Getters = new ConcurrentDictionary<string, Func<object, object>>();
var props = type.GetProperties(BindingFlags.Public | BindingFlags.Instance);
Properties = new ConcurrentDictionary<string, PropertyInfo>();
foreach (var prop in props)
{
if (prop.CanRead)
{
var objGetter = prop.GetValueGetter();
Getters[prop.Name] = objGetter.Compile();
}
if (prop.CanWrite)
{
var objSetter = prop.GetValueSetter();
Setters[prop.Name] = objSetter.Compile();
}
Properties[prop.Name] = prop;
}
}

public Type Type { get; private set; }
public ConcurrentDictionary<string, Action<object, object>> Setters { get; private set; }
public ConcurrentDictionary<string, Func<object, object>> Getters { get; private set; }
public ConcurrentDictionary<string, PropertyInfo> Properties { get; private set; }
}

public static class TypeCacheExtensions
{
/// <summary>
/// Set the value of a property by name
/// </summary>
/// <param name="cache"></param>
/// <param name="item"></param>
/// <param name="key"></param>
/// <param name="value"></param>
public static void Set<T>(this TypeCache cache, object item, string key, T value)
{
if (cache == null || item == null) return;
Action<object, object> setter;
if (!cache.Setters.TryGetValue(key, out setter)) return;
setter(item, (object)value);
}

/// <summary>
/// Get the value of a property by name
/// </summary>
/// <param name="cache"></param>
/// <param name="item"></param>
/// <param name="key"></param>
/// <returns></returns>
public static T Get<T>(this TypeCache cache, object item, string key)
{
if (cache == null || item == null) return default(T);
Func<object, object> getter;
if (!cache.Getters.TryGetValue(key, out getter)) return default(T);
return (T)getter(item);
}

/// <summary>
/// Set the value for a property to default by name
/// </summary>
/// <param name="cache"></param>
/// <param name="item"></param>
/// <param name="key"></param>
public static void Delete(this TypeCache cache, object item, string key)
{
if (cache == null || item == null) return;
Action<object, object> setter;
if (!cache.Setters.TryGetValue(key, out setter)) return;
var value = cache.Properties[key].PropertyType.GetDefaultValue();
setter(item, value);
}

/// <summary>
/// Set the values for all the public properties of a class to their default
/// </summary>
/// <param name="cache"></param>
/// <param name="item"></param>
public static void Clear(this TypeCache cache, object item)
{
if (cache == null || item == null) return;
Action<object, object> setter;
foreach (var pair in cache.Properties)
{
if (!cache.Setters.TryGetValue(pair.Key, out setter)) continue;
var value = pair.Value.PropertyType.GetDefaultValue();
setter(item, value);
}
}
}

(I used extension methods so that there would be no problems using a null value as the TypeCache) This class would be used something like this:
class Program
{
static void Main(string[] args)
{
var obj = new TestObject();
var cache = TypeCache.Get(obj.GetType());
Stopwatch sw = new Stopwatch();
sw.Start();
for (var i = 0; i < 100000; i++)
{
cache.Get<int>(obj, "TestProperty");
cache.Set<int>(obj, "TestProperty",i);
}
sw.Stop();
Console.WriteLine("Time: " + sw.Elapsed.TotalMilliseconds);
Console.ReadKey();
}
}

If you try to compile this, after adding all the namespaces required (System.Collections.Concurrent and System.Reflection), you will get an error, because you are missing the extension methods PropertyInfo.GetValueGetter and PropertyInfo.GetValueSetter. These should be returning Expressions that compiled get or set the value of an object. Here is the first attempt, using the normal PropertyInfo methods:
    public static class ReflectionExtensions
{
public static Expression<Func<object, object>> GetValueGetter(this PropertyInfo propertyInfo)
{
var getter=propertyInfo.GetGetMethod();
return (obj) => getter.Invoke(obj, new object[] { });
}

public static Expression<Action<object, object>> GetValueSetter(this PropertyInfo propertyInfo)
{
var setter = propertyInfo.GetSetMethod();
return (obj,val) => setter.Invoke(obj, new[] { val });
}

public static object GetDefaultValue(this Type type)
{
return type.IsValueType ? Activator.CreateInstance(type) : null;
}
}
Wonderful thing! GetGet and GetSet :) Basically I am returning expressions that use reflection to get the getter/setter and then execute them. How long would the program with one million tries of get and set take? 1000 milliseconds. Could we improve on that?

Before .Net 4.0 the only solution to do that would have been to emit IL and compile it inside the code. Actually, even in 4.0 it is the most efficient option. But given my incompetence in that direction, I will give you the (slightly) easier to understand solution. Here we sacrifice a little speed for the readability of code:
    public static class ReflectionExtensions
{
/// <summary>
/// Get the expression of a value getter for a property. Compile the expression to execute or combine it with other expressions.
/// </summary>
/// <param name="propertyInfo"></param>
/// <returns></returns>
public static Expression<Func<object, object>> GetValueGetter(this PropertyInfo propertyInfo)
{
var instance = Expression.Parameter(typeof(object), "i");
var convertObj = Expression.TypeAs(instance, propertyInfo.DeclaringType);
var property = Expression.Property(convertObj, propertyInfo);
var convert = Expression.Convert(property, typeof(object));
return (Expression<Func<object, object>>)Expression.Lambda(convert, instance);
}

/// <summary>
/// Get the expression of a value setter for a property. Compile the expression to execute or combine it with other expressions.
/// </summary>
/// <param name="propertyInfo"></param>
/// <returns></returns>
public static Expression<Action<object, object>> GetValueSetter(this PropertyInfo propertyInfo)
{
var instance = Expression.Parameter(typeof(object), "i");
var argument = Expression.Parameter(typeof(object), "a");
var convertObj = Expression.TypeAs(instance, propertyInfo.DeclaringType);
var convert = Expression.Convert(argument, propertyInfo.PropertyType);
var setterCall = Expression.Call(convertObj,propertyInfo.GetSetMethod(),convert);
return (Expression<Action<object, object>>)Expression.Lambda(setterCall, instance, argument);
}

/// <summary>
/// Get the default value of a type
/// </summary>
/// <param name="type"></param>
/// <returns></returns>
public static object GetDefaultValue(this Type type)
{
return type.IsValueType ? New.Instance(type) : null;
}
}

/// <summary>
/// Class used to get instances of a type
/// </summary>
public static class New
{
private static ConcurrentDictionary<Type, Func<object>> _dict = new ConcurrentDictionary<Type, Func<object>>();

public static object Instance(Type type)
{
Func<object> func;
if (!_dict.TryGetValue(type, out func))
{
func = Expression.Lambda<Func<object>>(Expression.New(type)).Compile();
_dict[type] = func;
}
return func();
}
}

The rest of the article will be about explaining what these extension methods are doing. How fast were they? 300ms! 3.3 times faster. How did we do that?

Well, first of all let's get the New class out of the way. In this simple version it just creates instances of a type that has a parameterless constructor. In the case that I had, using simple objects to transfer information, I only needed that. What is does is create a lambda expression that represents the parameterless constructor of a type, compiles it and caches it. Even a simple thing like this is faster than Activator.GetInstance(type). Not by much, about 10%, but still. Anyway, we are only using this to delete the value of a property (by setting it to its default value), which is not really in the scope of our test. However, being a simple expression build, it shows you the things to come.

Now for the GetValueGetter and GetValueSetter methods. I will humanly read you the GetValueGetter method. You can correlate it with the code quite easily. Basically it says this:
  • the expression has one parameter, of type object
  • we will attempt to safely convert it (as) into the declaring type (the object the property belongs to)
  • we will access a property of an object of the declaring type
  • which we then convert to object so that the resulting expression returns object not the specific type of the property
  • transform it all into a lambda expression and return it

The major difficulty, for me at least, was to grasp that there is no object (there is no spoon), but only a lambda expression that receives a parameter of type object and returns another object. The expression would be compiled and then applied on an actual instance.

And that's that. The object that does the serialization and transformation to SqlParameters and gets/sets the values of the wrapped object is more complicated and I will probably not write a blog entry about it, but think about the concept a little: an object that receives another object as the constructor and works like a dictionary. The keys and values of the dictionary are filled by the property names and values of the original object, while any change in the values of the dictionary will be set to the properties of the object. It makes it very easy to access objects generically, controlling what accessors do, but without the need for PostSharp or T4 templating, real time, programmatic. The task of serialization is taken by the ISerializable interface, which also uses a type of dictionary, meaning the object can be passed around from web services to C# code and then to SQL via SqlParameters.

and has 0 comments
Short post about how to use language detection capabilities in your application. I will be demonstrating NTextCat, which is a .NET port of text_cat, a Perl script, itself an implementation of a whitepaper published in 1994 called N-Gram-Based Text Categorization.

    There are four steps to language detection using NTextCat:
  • Reference NTextCat - the library is now available as a NuGet package as well
  • Instantiate an identifier factory (usually RankedLanguageIdentifierFactory)
  • Get an instance of a RankedLanguageIdentifier from the factory by loading a language XML file
  • Call the Identify method on your text and get a list of languages in the order of the probability that the text in that language

Here is a piece of code that does that using the core XML file published with the library. Remember to add the XML to your project and set its property of Copy to Output Directory.

public class LanguageProcessor
{
private RankedLanguageIdentifier _identifier;

public string IdentifyLanguage(string text)
{
if (_identifier == null)
{
var file = new FileInfo(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "LanguageModels/Core14.profile.xml"));
if (!file.Exists)
{
throw new FileNotFoundException("Could not find LanguageModels/Core14.profile.xml to detect the language");
}
using (var readStream = File.OpenRead(file.FullName))
{
var factory = new RankedLanguageIdentifierFactory();
_identifier = factory.Load(readStream);
}
}
var languages = _identifier.Identify(text);
var mostCertainLanguage = languages.FirstOrDefault();
if (mostCertainLanguage != null)
{
return mostCertainLanguage.Item1.Iso639_3;
}
return null;
}

}

There are a lot of XML files, some taken from Wikipedia, for example, and handling 280+ languages, but for the casual purpose of finding non-English text in a list, the core one will suffice.

and has 0 comments
It has been a while since I've blogged something technical. It's just that nothing I did seemed to be worthy of this majestic blog... Well, jokes aside, here is a post detailing my log4net JIRA appender.

log4net is a popular (if not the most popular) logging framework for .NET out there. Its strength lies in its configurability, the possibility to create custom loggers, custom appenders, custom filters, etc. I will be talking about a custom appender, a class that can be loaded by log4net to consume the logged lines and put them somewhere. For example to make an application that uses log4net to write the log to the console all you do is configure it to use the console appender. The JIRA appender takes the log output and creates issues in JIRA, notifying users afterwards. JIRA is a tracker for team planning. It is also very popular.

In order to create an appender, one references the log4net assembly (or NuGet package) and then creates a class that inherits from AppenderSkeleton. We could implement IAppender, but the skeleton class has most of what people want from an appender. The next step is to override the Append method and we are done. We don't want to create an issue with each logged line, though, so we will make it so that it creates the issue after a period of inactivity or when the logger closes. For that we use the CancellationTokenSource class to create delayed actions that we can cancel and recreate. We also need to override OnClose().

For the JIRA Api I used a project called AnotherJiraRestClient, but I guess one can used anything out there. You will see that the notify functionality is not implemented so we have to add it.

Here is the appender source code:
using AnotherJiraRestClient;
using log4net.Appender;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace JiraLog4netAppender
{
public class JiraAppender : AppenderSkeleton // most appender functionality is found in this abstract class
{
private List<string> _notifyUsers;

// properties of the appender, configurable in the log4net.config file as children elements of the appender
public string url { get; set; } // the url of the JIRA site
public string user { get; set; } // the JIRA user
public string password { get; set; } // the JIRA password
public string project { get; set; } // the JIRA project
public string notify // a comma separated list of JIRA users who will be notified of the issue's creation
{
get
{
return string.Join(", ", _notifyUsers);
}
set
{
_notifyUsers.Clear();
if (!string.IsNullOrWhiteSpace(value))
{
_notifyUsers.AddRange(value.Split(',').Select(s => s.Trim()).Where(s => !string.IsNullOrWhiteSpace(s)));
}
}
}

CancellationTokenSource _ts;
StringWriter _sw;
Task _task;
private JiraClient _jc;
private string _priorityId;
private string _issueTypeId;

private object _writerLock=new object();

public JiraAppender()
{
_notifyUsers = new List<string>();
_ts = new CancellationTokenSource(); // use this to cancel the Delay task
_sw = new StringWriter();
}

protected override void Append(log4net.Core.LoggingEvent loggingEvent)
{
this.Layout.Format(_sw, loggingEvent); // use the appender layout to format the log lines (warning: for this you need to override RequiresLayout and return true)
_ts.Cancel(); // cancel the task and create a new one. This means as long as the logger writes, the 5 second delay will be reset
_ts = new CancellationTokenSource();
_task = Task.Delay(5000, _ts.Token).ContinueWith(writeToJira, _ts.Token); // after 5 seconds (I guess you could make this configurable as well) create a JIRA issue
}

protected override bool RequiresLayout
{
get
{
return true;
}
}

/// <summary>
/// write to jira, either when 5 seconds of inactivity passed or the appender is closed.
/// </summary>
/// <param name="task"></param>
private void writeToJira(Task task)
{
string s;
lock (_writerLock) // maybe the method was already in progress when another one was called. We need to clear the StringWriter before we allow access to it again
{
s = _sw.ToString();
var sb = _sw.GetStringBuilder();
sb.Clear();
}
if (!string.IsNullOrWhiteSpace(s))
{
writeTextToJira(s);
}
}

private void writeTextToJira(string text)
{
ensureClientAndValues();
var summary = "Log: " + this.Name; // the issue summary
var labels = new List<string> // some labels
{
this.Name, this.GetType().Name
};
var issue = new AnotherJiraRestClient.JiraModel.CreateIssue(project, summary, text, _issueTypeId, _priorityId, labels); // create the issue with type Issue and priority Trivial
var basicIssue = _jc.CreateIssue(issue);
_jc.Notify(basicIssue, _notifyUsers, "JiraAppender created an issue", null, null); // notify users of the issue's creation
}

/// <summary>
/// Make sure we have a JiraClient and that we know the ids of the Issue type and the Trivial priority
/// </summary>
private void ensureClientAndValues()
{
if (_jc == null)
{
_jc = new JiraClient(new JiraAccount
{
ServerUrl = url,
User = user,
Password = password
});
}
if (_priorityId==null) {
var priority = _jc.GetPriorities().FirstOrDefault(p => p.name == "Trivial");
if (priority == null)
{
throw new Exception("A priority with the name 'Trivial' was not found");
}
_priorityId = priority.id;
}
if (_issueTypeId == null)
{
var meta = _jc.GetProjectMeta(project);
var issue = meta.issuetypes.FirstOrDefault(i => i.name == "Issue");
if (issue == null)
{
throw new Exception("An issue type with the name 'Issue' was not found");
}
_issueTypeId = issue.id;
}
}

protected override void OnClose() //clean what you can and write to jira if there is anything to write
{
_ts.Cancel();
writeToJira(null);
_sw.Dispose();
_ts.Dispose();
_task = null;
base.OnClose();
}

}
}

As I said, AnotherJiraRestClient does not implement the notify API call needed to inform the users of the creation of an issue, so we need to change the project a little bit. Perhaps when you are implementing this, you will find notify already there, with a different format, but just in case you don't:
  • add to JiraClient the following method:
    public void Notify(BasicIssue issue, IEnumerable<string> userList, string subject, string textBody, string htmlBody)
    {
    var request = new RestRequest()
    {
    Method = Method.POST,
    Resource = ResourceUrls.Notify(issue.id),
    RequestFormat = DataFormat.Json
    };
    request.AddBody(new NotifyData
    {
    subject = subject,
    textBody = textBody,
    htmlBody = htmlBody,
    to = new NotifyTo
    {
    users = userList.Select(u => new User
    {
    active = false,
    name = u
    }).ToList()
    }
    });
    var response = client.Execute(request);
    if (response.StatusCode != HttpStatusCode.NoContent)
    {
    throw new JiraApiException("Failed to notify users from issue with id=" + issue.id+"("+response.StatusCode+")");
    }
    }
  • add to ResourceUrls the following method:
    public static string Notify(string issueId)
    {
    return Url(string.Format("issue/{0}/notify", issueId));
    }
  • create the following classes in the JiraModel folder and namespace:
    public class NotifyData
    {
    public string subject { get; set; }
    public string textBody { get; set; }
    public string htmlBody { get; set; }
    public NotifyTo to { get; set; }
    }

    public class NotifyTo
    {
    public List<User> users { get; set; }
    }

    public class User
    {
    public string name { get; set; }
    public bool active { get; set; }
    }

Here is an example configuration. Have fun!
<log4net>
<appender name="JiraAppender" type="JiraLog4netAppender.JiraAppender, JiraLog4netAppender">
<url value="https://my.server.url/jira"/>
<user value="jirauser"/>
<password value="jirapassword"/>
<project value="jiraproject"/>
<notify value="siderite,someotheruser" />
<layout type="log4net.Layout.PatternLayout">
<conversionPattern value="%date [%thread] %-5level %logger [%property{NDC}] - %message%newline" />
</layout>
<filter type="log4net.Filter.LevelRangeFilter">
<levelMin value="WARN" />
<levelMax value="FATAL" />
</filter>
</appender>
<root>
<level value="DEBUG" />
<appender-ref ref="JiraAppender" />
</root>
</log4net>

and has 1 comment
Intro (click to hide)
Sometimes, after making some software or another and feeling all proud that everything works, I get an annoying exception that the path of the files my program uses is too long, the horrible PathTooLongException. At first I thought it was a filesystem thing or an operating system thing, or even a .Net version thing, but it is not. The exception would appear in .Net 4.5 apps running on Windows 7 systems that used NTFS on modern hardware.

Lazy as any software dev, I googled for it and found a quick and dirty replacement for System.IO called Delimon that mitigated the issue. This library is trying to do most of what System.IO does, but using the so called extended-length paths, stuff that looks like \\?\C:\Windows, with external functions accessed directly from kernel32.dll. All good and nice, but the library is hardly perfect. It has bugs, it doesn't implement all of the System.IO functionality and feels wrong. Why would Microsoft, the great and mighty, allow such a ridiculous problem to persist in their core filesystem library?

And the answer is: because it is a core library. Probably they would have to make an enormous testing effort to change anything there. It is something from a managed code developer nightmare: unsafe access to system libraries, code that spans decades of work and a maze of internal fields, attributes, methods, properties that can be used only from inside the library itself. Not to mention all those people who decided to solve problems in core classes using reflection and stuff. My guess is that they probably want to replace file system usage with another API, like Windows.Storage, which, alas, are only used for Windows Phone.


In this blog post I will discuss the System.IO problems that relate to the total length of a path, what causes them and possible solutions (if any).


Let's start with the exception itself: PathTooLongException. Looking for usages of the exception in the mscorlib.dll assembly of .Net 4.0 we see some interesting things. First of all, there is a direct translation from Windows IO error code 206 to this exception, so that means that, in principle, there should be no managed code throwing this exception at all. The operating system should complain if there is a path length issue. But that is not the case in System.IO.

Most of the other usages of the exception come from the class PathHelper, a helper class used by the System.IO.Path class in a single method: NormalizePath. Wonderful method, that: internal static unsafe. PathHelper is like a multiple personality class, the active one being determined by the field useStackAlloc. If set to true, then it uses memory and speed optimized code, but assumes that the longest path will always be 260. That's a constant, it is not something read from the operating system. If set to false, the max path length is also provided as a parameter. Obviously, useStackAlloc is set to true in most situations. We will talk about NormalizePath a bit later.

The other usages of the PathTooLongException class come from two Directory classes: Directory and LongPathDirectory. If you instantly thought "Oh, God, I didn't know there was a LongPathDirectory class! We can use that and all problems disappear!", I have bad news for you. LongPathDirectory is an internal class. Other than that it seems to be a copy paste clone of Directory that uses Path.LongMaxPath instead of hardcoded constants (248 maximum directory name length, for example) or... Path.MaxPath. If you thought MaxPath and LongMaxPath are properties that can be set to fix long path problems, I have bad news for you: they are internal constants set to 260 and 32000, respectively. Who uses this LongPathDirectory class, though? The System.IO.IsolatedStorage namespace. We'll get back to this in a moment.

Back to Path.NormalizePath. It is a nightmare method that uses a lot of internal constants, system calls, convoluted code; it seems like someone deliberately tried to obfuscate its code. It's an internal method, of course, which makes no sense, as the functionality of path normalization would be useful in a lot of scenarios. Its first parameter is path, then fullCheck, which when true leads to extra character validation. The fourth parameter is expandShortPaths which calls the GetLongPathName function of kernel32.dll. The third parameter is more interesting, it specifies the maximum path length which is sent to PathHelper or makes local checks on the path length. But who uses this parameter?

And now we find a familiar pattern: there is a class (internal of course) called LongPath, which seems to be a clone of Path, only designed to work with long paths. Who uses LongPath? LongPathDirectory, LongPathFile and classes in the System.IO.IsolatedStorage namespace!


So, another idea becomes apparent. Can we use System.IO.IsolatedStorage to have a nice access to the file system? No we can't. For at least two reasons. First of all, the isolated storage paradigm is different from what we want to achieve, it doesn't access the raw file system, instead it isolates files in containers that are accessible to that machine, domain, application or user only. Second, even if we could get an "isolated" store that would represent the file system - which we can't, we would still have to contend with the string based way in which IsolatedStorage works. It is interesting to note that IsolatedStorage is pretty much deprecated by the Windows 8 Windows.Storage API, so forget about it. Yeah, so we have LongPathDirectory, LongPathFile and LongPath classes, but we can't really use them. Besides, what we want is something more akin to DirectoryInfo and FileInfo, which have no LongPath versions.

What can we do about it, then? One solution is to use Delimon. It has some bugs, but they can be avoided or fixed, either by the developers or by getting the source/decompiling the library and fixing the bugs yourself. A limited, but functional solution.
An interesting alternative is to use libraries the BCL team published for long path access: LongPath which seems to contain classes similar to the ones we find in mscorlib, but it's latest release is from 2010 or Zeta long paths which has a more recent release, 2013, but is completely different, using the FileInfo and DirectoryInfo paradigm, too.

Of course, you can always make your own API.

Another solution is to be aware of the places where the length limitation appears and avoid them via other type of development, in other words, a file system best practices compilation that eventually turns into a new file system API.

Both solutions coalesce into using some other library instead of System.IO. That's horrible and I think a stain on .Net's honor!


So let's see where the exception is actually thrown.

I've made some tests. First of all, I used FAR Manager, a file manager, to create folders of various lengths. The longest one was 255, before I got an exception. To my surprise, Windows Explorer could see it, but it could not open or copy/rename/delete it. I reduced the size of its name until the total size of the path was 260, then I could manipulate it in Windows Explorer. So there are external reasons for not creating paths as long, but we see that there are tools that can access files like that. Let's attempt to create some programatically.

System.IO.Directory.CreateDirectory immediately fires the exception. DirectoryInfo has no problem instantiating with the long path as the parameter, but the Create method throws the same exception. Any attempt to create a folder of more than 248 characters, even if the total path was less than 260 characters, failed as well.

However, with reflection to the rescue, I could create paths as long as 32000 characters and folders with names as long as 255 characters using our friend LongPathDirectory:
var longPathDirectoryType = typeof(System.IO.Directory).Assembly.GetTypes().First(t=>t.Name=="LongPathDirectory");
var createDirectoryMethodInfo = longPathDirectoryType.GetMethod("CreateDirectory", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.NonPublic);
createDirectoryMethodInfo.Invoke(null, new object[] { path });

What about files? FAR Manager threw the same errors if I tried to create a filename larger than 255 characters. Let's try to create the same programmatically.

File.Create threw the exception, as well as FileInfo.Create and the FileStream constructors.

So can we use the same method and use LongPathFile? No! Because LongPathFile doesn't have the creating and opening functionalities of File. Instead, FileStream has a constructor that specifies useLongPath. It is internal, of course, and used only by IsolatedStorageFileStream!

Code to create a file:
var fileStreamConstructorInfo = typeof(System.IO.FileStream).GetConstructor(System.Reflection.BindingFlags.NonPublic|System.Reflection.BindingFlags.Instance,null,
new Type[] {
typeof(string) /*path*/, typeof(System.IO.FileMode) /*mode*/, typeof(System.IO.FileAccess) /*access*/,
typeof(System.IO.FileShare) /*share*/, typeof(int) /*bufferSize*/, typeof(System.IO.FileOptions) /*options*/,
typeof(string) /*msgPath*/, typeof(bool) /*bFromProxy*/, typeof(bool) /*useLongPath*/, typeof(bool) /*checkHost*/},null);
var stream = (System.IO.Stream)fileStreamConstructorInfo.Invoke(new object[] {
path, System.IO.FileMode.Create, System.IO.FileAccess.Write,
System.IO.FileShare.None, 4096, System.IO.FileOptions.None,
System.IO.Path.GetFileName(path), false, true, false
});
Horrible, but it works. Again, no filenames bigger than 255 and the exception coming from the file system, as it should. Some info about the parameters: msgPath is the name of the file opened by the stream, if bFromProxy is true the stream doesn't try to demand some security permissions, checkHost... does nothing :) Probably someone wanted to add a piece of code there, but forgot about it.

Why did I use 4096 as the buffer size? Because that is the default value used by .Net when not specifying the value. Kind of low, right?

Now, this is some sort of midway alternative to using a third party file system library: you invoke through reflection code that is done by Microsoft and hidden for no good reason. I don't condone using it, unless you really need it. What I expect from the .Net framework is that it takes care of all this for me. It seems (as detailed in this blog post), that efforts are indeed made. A little late, I'd say, but still.

and has 0 comments
It just happens that I have two different projects that have the need of cluster analysis, applied in two different ways: one has uses on maps, where a large number of items needs to be displayed quickly, while another implies finding clusters of news items, where the distance between them is determined by their content. The most used clustering algorithm and the first to be found by searching the web is the k-means clustering. Its purpose is to categorize a list of items into a number of k clusters, hence the name. Setting aside the use value of the algorithm for my purposes, the biggest problem I see is the complexity: in its general form it is at least O(n2), and most of the time a lot higher. The net abounds with scientific papers investigating the k-means complexity and suggesting improvements, but they are highly mathematical and I didn't have the time to investigate further. So I just built my own algorithm. It is clearly fuzzy, imperfect, may even be wrong in some situations, but at least it is fast. I will certainly investigate this area more, maybe even try to understand the math behind it and analyse my results based on this research. When I do that I will update this post or write others. But until then, let me present my work so far.

The first problem I had was, as I said, complexity. For one million points on the map, any algorithm that takes into account the distance between any two items will have to make at least one trillion comparisons. So my solution was to limit the number of items by grouping them in a grid:
Step 1: find the min and max on each dimension (that means going through the entire item collection once or knowing beforetime the map bounds)
Step 2: determine a number of cells that would be a bit more than what I need in the end. (that's a decision I have to take, no algorithmic complexity)
Example: for my map example I have only two dimensions: X and Y. I want to display an upper bound of 1000 clusters. Therefore I find the minimum and maximum X and Y and then split each dimension into 100 slots. That means I would cluster the items I have into 10000 cells.
Step 3: for each item, find its cell based on X,Y and add the item to the cell. This is done by simple division: (X-minX)/(maxX-minX). (again that means going once through the collection)
Step 4: find the smallest cell (the complexity is reduced now to working with cells)
Step 5: find its smallest neighbour (the complexity of this on the implementation)
Step 6: merge the two cells
Until the number of cells is larger than the desired number of clusters, repeat from Step 4.
In the end, the algorithm is O(n+p*log(p)), I guess, where p is the number of cells chosen at step 2.

Optimizations are the next issue.
  • How does one find the neighbours of a cell? On Step 3 we also create a list of neighbors for each new cluster by looking for a cluster that is at coordinates immediately above, below, left or right. When we merge two clusters, we get a cluster that is a neighbour to all the neighbours of the merged clusters.
  • How does one quickly get the cluster at a certain position? We create a dictionary that has the coordinates as the key. What about when we merge two clusters? Then the new cluster will be accessible by any of the original cluster keys (that implied that each cluster has a list of keys, as well)
  • How does one find the smallest cell in the cluster list? After Step 3 we sort the cluster list by the number of items they contain and each time we perform a merge we find the first item larger than the merged result and we insert it in the list at that location, so that the list always remains sorted.
  • How do we easily find the first item larger than an item? By employing a divide-et-impera method of splitting the list in two at each step and choosing to look into one bucket based on the item count of the cluster at the middle position

Before you use the code note that there are specific scenarios where this type of clustering would look a bit off, like items in a long line or an empty polygon (the cluster will appear to be in its center). But I needed speed and I got it.

Enjoy!

Update: The performance of removing or adding items from a List is very poor, so I created a LinkedList version that seems to be even faster. Here it is. The old List version is at the end

/// <summary>
/// Generic x,y positioned item class
/// </summary>
public class Item
{
public double X { get; set; }
public double Y { get; set; }
}

public class ClusteringDataProcessor
{
/// <summary>
/// the squared root of the initial cell number (100*100 = 10000)
/// </summary>
private const int initialClustersSquareSide = 100;
/// <summary>
/// the desired number of resulting clusters
/// </summary>
private const int numberOfFinalClusters = 1000;
private static Random _rnd = new Random();

/// <summary>
/// In this implementation, the Cluster inherits from Item, so the result is a list of Item
/// In the case of one Item Clusters, we actually return the original Item
/// </summary>
/// <param name="list"></param>
/// <returns></returns>
public List<Item> Process(List<Item> list)
{
if (list.Count <= numberOfFinalClusters) return list;
// find bounds. If already known, this can be provided as parameters
double minY = double.MaxValue;
double minX = double.MaxValue;
double maxY = double.MinValue;
double maxX = double.MinValue;
foreach (var item in list)
{
var y = item.Y;
var x = item.X;
minY = Math.Min(minY, y);
maxY = Math.Max(maxY, y);
minX = Math.Min(minX, x);
maxX = Math.Max(maxX, x);
}
// the original list of clusters
var clusterArr = new List<Cluster>();

// the dictionary used to index clusters on their position
var clusterDict = new Dictionary<string, Cluster>();

// the unit for finding the cell position for the initial clusters
var qX = (maxX - minX) / initialClustersSquareSide;
var qY = (maxY - minY) / initialClustersSquareSide;

foreach (var item in list)
{
// compute cell coordinates (integer and the values used as keys in the dictionary)
var cx = Math.Min((int)((item.X - minX) / qX), initialClustersSquareSide - 1);
var cy = Math.Min((int)((item.Y - minY) / qY), initialClustersSquareSide - 1);
var key = getKey(cx, cy);
Cluster cluster;
// if the cluster for this position does not exist, create it
if (!clusterDict.TryGetValue(key, out cluster))
{
cluster = new Cluster
{
Keys = new List<string> { key },
X = minX + cx * qX + qX / 2,
Y = minY + cy * qY + qY / 2,
//Items = new List<Item>(),
Count = 0,
Neighbors = new List<string>()
};
// the neighbours of this cluster are the existing clusters that are below, above, left or right. If they exist, this cluster also is added to their neighbour list
var nkeys = new[] { getKey(cx - 1, cy), getKey(cx + 1, cy), getKey(cx, cy - 1), getKey(cx, cy - 1) };
for (var j = 0; j < 4; j++)
{
Cluster nc;
if (clusterDict.TryGetValue(nkeys[j], out nc))
{
cluster.Neighbors.Add(nkeys[j]);
nc.Neighbors.Add(key);
}
}
clusterDict[key] = cluster;
clusterArr.Add(cluster);
}
// add the item to the cluster (note that the commented lines hold the items in a list, while the current implementation only remember the number of items)
//cluster.Items.Add(item);
cluster.Item = item;
cluster.Count++;
// add the item position to the sums, so that we can compute the final position of the cluster at the Finalize stage without enumerating the items (or having to hold them in an Items list)
cluster.SumX += item.X;
cluster.SumY += item.Y;
}
// if the number of items is already smaller than the desired number of clusters, just return the clusters
if (clusterArr.Count <= numberOfFinalClusters)
{
return clusterArr.Select(c => c.Finalize()).ToList();
}

// sort the cluster list so we can efficiently find the smallest cluster
//clusterArr.Sort(new Comparison<Cluster>((c1, c2) => c1.Items.Count.CompareTo(c2.Items.Count)));
LinkedList<Cluster> clusterLinkedList = new LinkedList<Cluster>(clusterArr.OrderBy(c => c.Count));

// remember last merged cluster, as next merged clusters might have similar sizes
var lastCount = int.MaxValue;
LinkedListNode<Cluster> lastLinkedNode = null;

// do this until we get to the desired number of clusters
while (clusterLinkedList.Count > numberOfFinalClusters)
{
// we need to get the smallest (so first) cluster that has any neighbours
var cluster1 = clusterLinkedList.First(c => c.Neighbors.Any());
Cluster cluster2 = null;
// then find the smallest neighbour
var min = int.MaxValue;
foreach (var nkey in cluster1.Neighbors)
{
var n = clusterDict[nkey];
//var l = n.Items.Count;
var l = n.Count;
if (l < min)
{
min = l;
cluster2 = n;
}
}
// join the clusters
var keys = cluster1.Keys.Union(cluster2.Keys).ToList();
var cluster = new Cluster
{
Keys = keys,
// approximate cluster position, not needed
//X = (cluster1.X + cluster2.X) / 2,
//Y = (cluster1.Y + cluster2.Y) / 2,

// the result holds the count of both clusters
//Items = cluster1.Items.Union(cluster2.Items).ToList(),
Count = cluster1.Count + cluster2.Count,
// the neighbors are in the union of their neighbours that does not contain themselves
Neighbors = cluster1.Neighbors.Union(cluster2.Neighbors)
.Distinct()
.Except(keys)
.ToList(),
// compute the sums for the final position
SumX = cluster1.SumX + cluster2.SumX,
SumY = cluster1.SumY + cluster2.SumY
};
foreach (var key in keys)
{
clusterDict[key] = cluster;
}
// efficiently remove clusters since LinkedList removals are fast
clusterLinkedList.Remove(cluster1);
clusterLinkedList.Remove(cluster2);

// a little bit of magic to make the finding of the insertion point faster (LinkedLists go through the entire list to find an item)
// if the last merged cluster is smaller or equal to the new merged cluster, then start searching from it.
// this halves the insert time operation, but I am sure there are even better implementations, just didn't think it's so important
LinkedListNode<Cluster> start;
if (lastCount <= cluster.Count && lastLinkedNode.Value != cluster1 && lastLinkedNode.Value != cluster2)
{
start = lastLinkedNode;
}
else
{
start = clusterLinkedList.First;
}
var insertionPoint = nextOrDefault(clusterLinkedList, start, c => c.Count >= cluster.Count);
// remember last merged cluster
LinkedListNode<Cluster> link;
if (insertionPoint == null)
{
link = clusterLinkedList.AddLast(cluster);
}
else
{
link = clusterLinkedList.AddBefore(insertionPoint, cluster);
}
lastLinkedNode = link;
lastCount = cluster.Count;
}
return clusterLinkedList.Select(c => c.Finalize()).ToList();
}

private LinkedListNode<T> nextOrDefault<T>(LinkedList<T> list, LinkedListNode<T> start, Func<T, bool> condition)
{
while (start.Next != null)
{
if (condition(start.Value)) return start;
start = start.Next;
}
return null;
}

private string getKey(int cx, int cy)
{
return cx + ":" + cy;
}

private class Cluster : Item
{
public Cluster()
{
SumX = 0;
SumY = 0;
}

public double SumX { get; set; }
public double SumY { get; set; }

public List<string> Keys { get; set; }

//public List<Item> Items { get; set; }
public int Count { get; set; }
public Item Item { get; set; }

public List<string> Neighbors { get; set; }


/// <summary>
/// the function that finalizes the computation of the values or even decides to return the only item in the cluster
/// </summary>
/// <returns></returns>
public Item Finalize()
{
//if (Items.Count == 1) return Items[0];
if (Count == 1) return Item;
/*Y = SumY / Items.Count;
X = SumX / Items.Count;
Count = Items.Count;*/
Y = SumY / Count;
X = SumX / Count;
Count = Count;
return this;
}
}
}

old List based code (click to show)

While working on a small personal project, I decided to make a graphical tool that displayed a list of items in a Canvas. After making it work (for details go here), I've decided to make the items animate when changing their position. In my mind it had to be a simple solution, akin to jQuery animate or something like that; it was not.

The final solution was to finally give up on a generic method for this and switch to the trusted attached properties. But if you are curious to see what else I tried and how ugly it got, read it here:
Click here to get ugly!

Well, long story short: attached properties. I created two attached properties CanvasLeft and CanvasTop. When they change, I animate the real properties and, at the end of the animation, I set the value. Here is the code:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows;
using System.Windows.Controls;
using System.Windows.Media.Animation;

namespace Siderite.AttachedProperties
{
public static class UIElementProperties
{
public static readonly DependencyProperty CanvasLeftProperty = DependencyProperty.RegisterAttached("CanvasLeft", typeof(double), typeof(UIElementProperties), new FrameworkPropertyMetadata(
0.0,
FrameworkPropertyMetadataOptions.OverridesInheritanceBehavior,
CanvasLeftChanged));

[AttachedPropertyBrowsableForType(typeof(UIElement))]
public static double GetCanvasLeft(DependencyObject element)
{
if (element == null)
{
throw new ArgumentNullException("element");
}
return (double)element.GetValue(CanvasLeftProperty);
}

[DesignerSerializationVisibility(DesignerSerializationVisibility.Visible)]
public static void SetCanvasLeft(DependencyObject element, double value)
{
if (element == null)
{
throw new ArgumentNullException("element");
}
element.SetValue(CanvasLeftProperty, value);
}

private static void CanvasLeftChanged(DependencyObject d, DependencyPropertyChangedEventArgs e)
{
var sb = new Storyboard();
var oldVal = (double)e.OldValue;
if (double.IsNaN(oldVal)) oldVal = 0;
var newVal = (double)e.NewValue;
if (double.IsNaN(newVal)) newVal = oldVal;
var anim = new DoubleAnimation
{
From = oldVal,
To = newVal,
Duration = new Duration(TimeSpan.FromSeconds(1)),
FillBehavior = FillBehavior.Stop
};
Storyboard.SetTarget(anim, d);
Storyboard.SetTargetProperty(anim, new PropertyPath("(Canvas.Left)"));
sb.Children.Add(anim);
sb.Completed += (s, ev) =>
{
d.SetValue(Canvas.LeftProperty, newVal);
};
var fe = d as FrameworkElement;
if (fe != null)
{
sb.Begin(fe, HandoffBehavior.Compose);
return;
}
var fce = d as FrameworkContentElement;
if (fce != null)
{
sb.Begin(fce, HandoffBehavior.Compose);
return;
}
sb.Begin();
}


public static readonly DependencyProperty CanvasTopProperty = DependencyProperty.RegisterAttached("CanvasTop", typeof(double), typeof(UIElementProperties), new FrameworkPropertyMetadata(
0.0,
FrameworkPropertyMetadataOptions.OverridesInheritanceBehavior,
CanvasTopChanged));

[AttachedPropertyBrowsableForType(typeof(UIElement))]
public static double GetCanvasTop(DependencyObject element)
{
if (element == null)
{
throw new ArgumentNullException("element");
}
return (double)element.GetValue(CanvasTopProperty);
}

[DesignerSerializationVisibility(DesignerSerializationVisibility.Visible)]
public static void SetCanvasTop(DependencyObject element, double value)
{
if (element == null)
{
throw new ArgumentNullException("element");
}
element.SetValue(CanvasTopProperty, value);
}

private static void CanvasTopChanged(DependencyObject d, DependencyPropertyChangedEventArgs e)
{
var sb = new Storyboard();
var oldVal = (double)e.OldValue;
if (double.IsNaN(oldVal)) oldVal = 0;
var newVal = (double)e.NewValue;
if (double.IsNaN(newVal)) newVal = oldVal;
var anim = new DoubleAnimation
{
From = oldVal,
To = newVal,
Duration = new Duration(TimeSpan.FromSeconds(1)),
FillBehavior = FillBehavior.Stop
};
Storyboard.SetTarget(anim, d);
Storyboard.SetTargetProperty(anim, new PropertyPath("(Canvas.Top)"));
sb.Children.Add(anim);
sb.Completed += (s, ev) =>
{
d.SetValue(Canvas.TopProperty, newVal);
};
var fe = d as FrameworkElement;
if (fe != null)
{
sb.Begin(fe, HandoffBehavior.Compose);
return;
}
var fce = d as FrameworkContentElement;
if (fce != null)
{
sb.Begin(fce, HandoffBehavior.Compose);
return;
}
sb.Begin();
}
}
}

and this is how you would use them:

<ListView ItemsSource="{Binding KernelItems}" 
SelectedItem="{Binding SelectedItem,Mode=TwoWay}"
SelectionMode="Single"
>
<ListView.ItemsPanel>
<ItemsPanelTemplate>
<Canvas HorizontalAlignment="Stretch" VerticalAlignment="Stretch" Background="Black" />
</ItemsPanelTemplate>
</ListView.ItemsPanel>
<ListView.ItemContainerStyle>
<Style TargetType="{x:Type ListViewItem}">
<Setter Property="FocusVisualStyle" Value="{x:Null}"/>
<Setter Property="Foreground" Value="White"/>
<Setter Property="att:UIElementProperties.CanvasLeft" >
<Setter.Value>
<MultiBinding Converter="{StaticResource CoordinateConverter}">
<Binding Path="X"/>
<Binding Path="ActualWidth" ElementName="lvKernelItems"/>
<Binding Path="DataContext.Zoom" RelativeSource="{RelativeSource AncestorType={x:Type Window}}"/>
</MultiBinding>
</Setter.Value>
</Setter>
<Setter Property="att:UIElementProperties.CanvasTop" >
<Setter.Value>
<MultiBinding Converter="{StaticResource CoordinateConverter}">
<Binding Path="Y"/>
<Binding Path="ActualHeight" ElementName="lvKernelItems"/>
<Binding Path="DataContext.Zoom" RelativeSource="{RelativeSource AncestorType={x:Type Window}}"/>
</MultiBinding>
</Setter.Value>
</Setter>
<Style.Resources>
<SolidColorBrush x:Key="{x:Static SystemColors.HighlightBrushKey}" Color="Transparent" />
<SolidColorBrush x:Key="{x:Static SystemColors.ControlBrushKey}" Color="Transparent" />
<SolidColorBrush x:Key="{x:Static SystemColors.HighlightTextBrushKey}" Color="Black" />
<SolidColorBrush x:Key="{x:Static SystemColors.ControlTextBrushKey}" Color="Black" />
</Style.Resources>
<Style.Triggers>
<Trigger Property="IsSelected" Value="True">
<Setter Property="Foreground" Value="Cyan"/>
<Setter Property="Effect">
<Setter.Value>
<DropShadowEffect ShadowDepth="0" Color="White" Opacity="0.5" BlurRadius="10"/>
</Setter.Value>
</Setter>
<Setter Property="Canvas.ZIndex" Value="1000"/>
</Trigger>
</Style.Triggers>
</Style>
</ListView.ItemContainerStyle>
</ListView>

Hope it helps.

For a WPF project I wanted to create a graphical representation of a list of items. I computed some X,Y coordinates for each item and started changing the XAML of a Listview in order to reflect the position of each item on a Canvas. Well, it is just as easy as you imagine: change the ItemsPanel property to a Canvas and then style each item as whatever you want. The gotcha comes when trying to set the coordinates. The thing is that for each item in a listview a container is constructed and inside the item template is displayed. So here you have all you items displayed exactly as you want them, except the coordinates don't work, since what needs to be placed on the Canvas are the generated containers, not the items. Here is the solution:
<Window.Resources>
<local:CoordinateConverter x:Key="CoordinateConverter"/>
<DataTemplate DataType="{x:Type vm:KernelItem}"> <!-- the template for the data items -->
<Grid>
<Ellipse Fill="{Binding Background}" Width="100" Height="100" Stroke="DarkGray" Name="ellipse"
ToolTip="{Binding Tooltip}"/>
<TextBlock Text="{Binding Text}" MaxWidth="75" MaxHeight="75"
HorizontalAlignment="Center" VerticalAlignment="Center"
TextAlignment="Center"
TextWrapping="Wrap"
ToolTip="{Binding Tooltip}" />
</Grid>
</DataTemplate>
</Window.Resources>
<ListView ItemsSource="{Binding KernelItems}" Name="lvKernelItems"
SelectedItem="{Binding SelectedItem,Mode=TwoWay}"
SelectionMode="Single"
>
<ListView.ItemsPanel>
<ItemsPanelTemplate>
<Canvas HorizontalAlignment="Stretch" VerticalAlignment="Stretch" Background="Black" />
</ItemsPanelTemplate>
</ListView.ItemsPanel>
<ListView.ItemContainerStyle>
<Style TargetType="{x:Type ListViewItem}">
<Setter Property="FocusVisualStyle" Value="{x:Null}"/><!-- no highlight of selected items -->
<Setter Property="Foreground" Value="White"/>
<Setter Property="(Canvas.Left)" >
<Setter.Value>
<MultiBinding Converter="{StaticResource CoordinateConverter}">
<Binding Path="X"/>
<Binding Path="ActualWidth" ElementName="lvKernelItems"/>
<Binding Path="DataContext.Zoom" RelativeSource="{RelativeSource AncestorType={x:Type Window}}"/>
</MultiBinding>
</Setter.Value>
</Setter>
<Setter Property="(Canvas.Top)" >
<Setter.Value>
<MultiBinding Converter="{StaticResource CoordinateConverter}">
<Binding Path="Y"/>
<Binding Path="ActualHeight" ElementName="lvKernelItems"/>
<Binding Path="DataContext.Zoom" RelativeSource="{RelativeSource AncestorType={x:Type Window}}"/>
</MultiBinding>
</Setter.Value>
</Setter>
<Style.Resources><!-- no highlight of selected items -->
<SolidColorBrush x:Key="{x:Static SystemColors.HighlightBrushKey}" Color="Transparent" />
<SolidColorBrush x:Key="{x:Static SystemColors.ControlBrushKey}" Color="Transparent" />
<SolidColorBrush x:Key="{x:Static SystemColors.HighlightTextBrushKey}" Color="Black" />
<SolidColorBrush x:Key="{x:Static SystemColors.ControlTextBrushKey}" Color="Black" />
</Style.Resources>
<Style.Triggers>
<Trigger Property="IsSelected" Value="True"><!-- custom selected item template -->
<Setter Property="Foreground" Value="Cyan"/>
<Setter Property="Effect">
<Setter.Value>
<DropShadowEffect ShadowDepth="0" Color="White" Opacity="0.5" BlurRadius="10"/>
</Setter.Value>
</Setter>
<Setter Property="Canvas.ZIndex" Value="1000"/>
</Trigger>
</Style.Triggers>
</Style>
</ListView.ItemContainerStyle>
</ListView>

As a bonus, you see the way to remove the default selection of an item: the ugly dotted line and the highlighting background.

I created a little piece of software which was supposed to get as much of the content I was interested in, then display it in a browser. It was a WPF application, but I doubt it matters that much, since the WebBrowser control there is still based on the Internet Explorer COM control that is also used in Windows Forms.

Anyway, the idea was simple: connect to the URL of the item I was selecting, display the page in the browser and, if the title of the loaded document was that of an error page (hence, the browser could not load it - I found no decent way to determine the response error code) I would display the HTML content that I gathered earlier. These two feats were, of course, realized using the Navigate and NavigateToString methods - more or less hacked to look like valid MVVM for WPF :-P. Everything worked, went to the place where the Internet is a far away myth - my Italian residence, started the application and ...

FatalExecutionEngineError was detected Message: The runtime has encountered a fatal error. The address of the error was at 0x64808539, on thread 0xf84. The error code is 0x80131623. This error may be a bug in the CLR or in the unsafe or non-verifiable portions of user code. Common sources of this bug include user marshaling errors for COM-interop or PInvoke, which may corrupt the stack.. Ignore the hexadecimal numbers, I strongly doubt they were much help.

This horrible looking exception was thrown through a try/catch block and I could find no easy way to get to the source of the problem. The InnerException wasn't too helpful either: System.ExecutionEngineException was unhandled HResult=-2146233082 Message=Exception of type 'System.ExecutionEngineException' was thrown.

I did what every experienced professional does in these situations: googled for "WPF WebBrowser not working!!!". And I found this guy: Fatal Execution Error on browser ReadyState, who described a similar situation caused by interogating the document readyState property, which I was also doing! Amazingly, it worked. At first.

The second step of the operation was to go with this software to the place where the Internet exists, but it is guarded by huge trolls that live under the gateway - my Italian/European Commission workplace. Kaboom! The dreaded exception appeared again, even if I had configured my software to work with a firewall and an http proxy. Meanwhile, the harddrive of my laptop failed and I had to reinstall the software on the computer, from Windows 8 to Windows 7. Same error now consistently appeared at home.

At the end of this chain of events, the other tool of the software professional - blind trial and error - prevailed. All I had to do was to NavigateToString via the WebBrowser control Dispatcher, thus:

wbMain.Dispatcher.BeginInvoke(new Action(() =>
{
wbMain.NavigateToString(content);
}));


... and everyone lived happily ever after.

and has 0 comments
For a personal project of mine I needed to gather a lot of data and condense it into a newsletter. What I needed was to take information from selected blogs, google queries and various pages that I find and take only what was relevant into account. Great, I thought, I will make a software to help me do that. And now, proverbially, I have two problems.

The major issue is that after getting all the info I needed, I was stuck on reading thousands of web pages to get to the information I needed. I was practically spammed. The thing is that there aren't even so many stories, it's just the same content copied from news site to news site, changing only the basic structure of the text, maybe using other words or expanding and collapsing terms in and out of abbreviations and sometimes just pasting it exactly as it was in the source, but displayed in a different web page, with a different template.

So the challenge was to compare two or more web pages for the semantic similarity of the stories. While there is such theory as semantic text analysis, just google for semantic similarity and you will get mostly PDF academic white papers and software that is done in Python or some equally disgusting language used only in scientific circles. And while, true, I was intrigued and for a few days I entertained the idea of understanding all that and actually building a C# library up to the task, I did not have the time for it. Not to mention that the data file I was supposed to parse was growing day by day while I was dallying in arcane algorithms.

In conclusion I used a faster and more hackish way to the same end. Here is how I did it.



The first major hurdle was to clear the muck from the web page and get to the real information. A simple html node innerText would not do. I had to ignore not only HTML markup, but such lovely things as menus, ads, sidebars with blog information, etc. Luckily there is already a project that does that called Boilerpipe. And before you jump at me for linking to a Java project, there is also a C# port, which I had no difficulties to download and compile.

At the time of the writing, the project would not compile well because of its dependency to a Mono.Posix library. Fortunately the library was only used for two methods that were never used, so I just removed the reference and the methods and all was well.

So now I would mostly have the meaningful text of both web pages. I needed an algorithm to quickly determine their similarity. I skipped the semantic bit of the problem altogether (trying to detect synonyms or doing lexical parsing) and I resorted to String Kernels. Don't worry if you don't understand a lot of the Wikipedia page, I will explain how it works right away. My hypothesis was that even if they change some words, the basic structure of the text remains the same, so while I am trying to find the pages with the same basic meaning, I could find them by looking for pages with the same text structure.

In order to do that I created for each page a dictionary with string keys and integer values. The keys would be text n-grams from the page (all combinations of three characters that are digits and letters) and the values the count of those kernels in the Boilerpipe text. At first I also allowed spaces in the character list of kernels, but it only complicated the analysis.

To compare a page to others, I would take the keys in the kernel dictionary for my page and look for them in the dictionaries of other pages, then compute a distance out of the counts. And it worked! It's not always perfect, but sometimes I even get pages that have a different text altogether, but reference the same topic.

You might want to know what made me use 3-grams and not words. The explanation comes mostly from what I read first when I started to look for a solution, but also has some logic. If I would have used words, then abbreviations would have changed the meaning of the text completely. Also, I did not know how many words would have been in a few thousand web pages. Restricting the length to three characters gave me an upper limit for the memory used.


Conclusion: use the .Net port of Boilerpipe to extract text from the html, create a kernel dictionary for each page, then compute the vector distance between the dictionaries.

I also found a method to compare the dictionaries better. I make a general kernel dictionary (for all documents at once) and then the commonality of a bit of text is the number of times it appears divided by the total count of kernels. Or the number of documents in which it is found divided by the total number of documents. I chose commonality as the product of these two. Then, one computes the difference between kernel counts in two documents by dividing the squared difference for each kernel by its commonality and adding the result up. It works much better like this. Another side effect of this method is that one can compute how "interesting" a document is, by adding up the counts of all kernels divided by their commonality, then dividing that to the length of the text (or the total count of kernels). The higher the number, the less common its content would be.