Sparse arrays in JavaScript

Published Jun 15, 2025

Intro

Welcome to sparse arrays, a JavaScript concept that completely escaped me until now. Let's dig into it.

I wanted to create a game board, an array of 8 rows, themselves arrays of 8 elements so I unthinkingly did a const row=Array(8). I used a console.table on it and it was displayed in a stupid way, but I blamed it on Google devs and moved on. I didn't understand something was amiss until I did a .forEach on a row array that had only three pieces on it and it was executed just three times. What was going on?

Explanation

Well, when you use the Array(integer) constructor you get a sparse array. It has a length, it has items, but it is not a contiguous memory space with N slots for values. The same thing happens when you omit an element from an array declaration, like [1,,2] which is a sparse array different from [1, undefined, 2]. A normal array, in this context, is called a dense array.

In other words, sparse arrays work differently in for and forEach loops! It's almost like an empty object with a length property that has array methods working for it. Think about this code for example:

const arr = { length: 8 }
arr[1]='something';

// displays 8 lines, where only the second is 'something' and the rest are undefined
for (let i=0; i<arr.length; i++) console.log(arr[i]);

// simulated forEach - displays only 2 lines: something and 8
Object.values(arr).forEach(v=>console.log(v));

The simplest way to transform a sparse array into a dense array is to iterate it, like this: [...sparseArray]. You can create a dense array with Array.from( { length:8 } ), however new Array(8) or Array(8) will return a sparse array.

If you look in the docs, sparse arrays are often used with .fill, like this: Array(8).fill(undefined), which will return an array of 8 undefined elements.

Going deeper

So, wait... now there are two types of arrays, they function differently, they must be different classes, right? Actually, no! These are just concepts that apply to Array. Let's see some code:

const x = [1]; // dense array
x.length=10; // sparse array, all slots except the first are 'empty'.
x.fill(2); // dense array, all slots are filled with the value 2
x[1]=undefined; // dense array, second slot contains undefined
delete x[3]; // sparse array, fourth slot is 'empty'
// 1 in x == true;
// 3 in x == false;

What is this madness? This is something new introduced by nasty people who just want to make everything complicated, right? Actually, no. Arrays were always sparse, from the very beginning of the language in 1995. The concept itself was formalized in 2009 for ES5, though, where it was explicitly said that for..in and for..of and forEach, map, filter, etc, will skip empty slots.

Conclusion

JavaScript arrays are misnomers and different from the previous concept of array, used in languages like C or .NET, a fixed and contiguous memory space to store elements based on a positional index. JavaScript arrays are complex objects that allow insertion and removal of items, resizing and, yes, sparsity.

You can work in JavaScript for years and years and still get blindsighted by something like this...

Array.from - more powerful than you think

Published May 9, 2025

and has 0 comments

Array.from is a JavaScript function that until now, I've used exclusively to transform an iterable into an array. I had never considered the powerful factory functionality that it provides, because its declared signature is deceptively simple: Array.from(arrayLike, mapFn, thisArg). So you get something that's like an array (an interable, right?) then you map it (phaw, like I don't know how to map an array!) and a thisArg probably sets the this context for the function (like I don't know how to bind a function!)

And yet...

The first "array-like" parameter doesn't need to be an iterable, it can be an object with a length. Meaning that you can just do something a number of times.

Example for creating a range:

// Sequence generator function (commonly referred to as "range", cf. Python, Clojure, etc.)
const range = (start, stop, step) =>
  Array.from(
    { length: Math.ceil((stop - start) / step) },
    (_, i) => start + i * step,
  );

Getting a sample from an array of string:

const sample = (arr, count) =>
  Array.from(
    { length: count },
    ()=>arr[Math.floor(Math.random()*arr.length)]
  );

How about generating a random string?

const chars = 'abcdefghijklmnopqrstuvwxyz';
const randomString = (length) =>
  Array.from({ length }, () => chars[Math.floor(Math.random() * chars.length)]).join('');

// OR using the sample function above

const randomString = (length) =>
  sample('abcdefghijklmnopqrstuvwxyz',length).join('');

And if you don't want an array, but something "array-like", like a NodeList?

class MyClass {}

Array.from.call(MyClass, ["foo", "bar", "baz"]);
// MyClass {0: 'foo', 1: 'bar', 2: 'baz', length: 3}

I admit Array.from(a,b,c) is functionally equivalent to Array.from(a).map(b.bind(c)), but it's more efficient, as the array is constructed only once. The power of this function comes from its interpretation of the "array-like" first argument. It takes iterable objects and/or containing a length property, thus allowing a more concise, functional way of writing code and avoiding unnecessary loops.

Hope it helps!

Bypassing blocking Chrome certificate screens

Published Apr 28, 2025

Posted in
misc
programming
browser

and has 0 comments

Yep, it's is that easy, thanks to code added by the Chromium devs. You just make sure the focus is on the HTTPS error page, then type "thisisunsafe". A lot more details here: thisisunsafe - Bypassing chrome security warnings.

Is it a good idea? Probably not. Will it be remove by Google devs some time in the future? Probably yes. But sometimes you just need to access that site and don't care about other stuff.

Hope it helps!

Bookmark Surfer Daedalus (formerly known as Bookmark Explorer)

Published Mar 9, 2025

and has 0 comments

A while ago I wrote a browser extension called Bookmark Explorer that no one used because it had such a banal name, but it was pretty cool. I just let it die when Google took it down for being a manifest V2 extension. I plan to modernize it and make it work for modern browsers. It is now also renamed as...

Bookmark Surfer Daedalus!

OK, it's still a silly name, but you know how naming is the hardest part, right?

This blog post will become the official web page for the extension.

But what does it do? Well, it allows to easily navigate in bookmark folders. Let's say you have one of those folders with hundreds of links or you, like me, open a hundred YouTube videos that you plan to eventually watch (yeah, right!) and your main memory consumption is keeping those tabs in the background. Now all you have to do is just put all of these into bookmark folders and open the first one. From it, you can quickly go forward or backward with either mouse or key combinations. The next page in the folder will be preloaded while you read the current one, so that you switch faster.

Features:

right click on a link and add it to the Read Later folder without having to open it (the extension opens it in the background to get the final URL and the title of the page, then closes it immediately)
click on extension button to get a popup with arrows forward/backward to navigate in whichever folder the current page is bookmarked in. You also get a button to move the current page to the end of the folder, so as to read it later than the others.
use Ctrl-Shift-K/L for backward/forward navigation. You can also change the key shortcuts in the browser.
define patterns that determine what makes a URL unique. For example some web sites have different pages with the exact same URL, but a parameter changed, like YouTube or some blog platforms. Maybe even hashes. You can define the pattern for those domains. The normal pattern is just the domain and the path, ignoring query parameters and hashes.
manage the bookmarks in the current folder. You can select them individually, delete them, move them to the end or the start and so on.
see the bookmarks that you deleted - whether with the extension or some other mechanism - and choose which one to permanently remove or restore to their original location.
notifications of pages that you bookmarked multiple times
you can export the URLs of the selected items in an entire folder

This extension is essential for people who read a lot on the Internet, like researchers or simply people who love information.

Data access in code, using repositories, even with ORMs

Published Dec 12, 2024

and has 0 comments

Intro

In the .NET world one of the most used method of accessing databases is with Entity Framework (EF), an Object Relational Mapper (ORM) that is tightly integrated with the language syntax. Using the Language Integrated Queries (LINQ) native to .NET languages, it makes data access feel like working with normal .NET collections, without much knowledge of SQL. This has its benefits and drawbacks that I will try not to rant about here. But one of the issues that it consistently creates is confusion regarding the structure of the software project, levels of abstractions and ultimately unit testing.

This post will try to explain why the repository abstraction is ALWAYS useful. Note that many people use repository as a term for abstracted data access, while there is also a repository software pattern which relates to similar things, but it's not the same thing. In here, I will call a repository a series of interfaces abstracting the implementation details of data access and ignore the design pattern completely.

History

Feel free to skip this if you are aware of it, but I have to first address how we got to the idea of repositories to begin with.

In prehistory, code was just written as is, with no structure, everything in, doing what you wanted it to do or at least hoping to. There was no automated testing, just manual hacking and testing until it worked. Each application was written in whatever was on hand, with concerns about hardware requirements more important than code structure, reuse or readability. That's what killed the dinosaurs! True fact.

Slowly, patterns started to emerge. For business applications in particular, there was this obvious separation of the business code, the persistence of data and the user interface. These were called layers and soon separated into different projects, not only because they covered different concerns, but also because the skills necessary to build them were especially different. UI design is very different from code logic work and very different from SQL or whatever language or system was used to persist data.

Therefore, the interaction between the business and the data layer was done by abstracting it into interfaces and models. As a business class, you wouldn't ask for the list of entries in a table, you would require a filtered list of complex objects. It would be the responsibility of the data layer to access whatever was persisted and map it to something understandable to business. These abstractions started to be called repositories.

On the lower layers of data access, patterns like CRUD quickly took over: you defined structured persistence containers like tables and you would create, read, update or delete records. In code, this kind of logic will get abstracted to collections, like List, Dictionary or Array. So there was also a current of opinion that repositories should behave like collections, maybe even be generic enough to not have other methods than the actual create, read, update and delete.

However, I strongly disagree. As abstractions of data access from business, they should be as far removed from the patterns for data access as possible, instead being modelled based on the business requirements. Here is where the mindset of Entity Framework in particular, but a lot of other ORMs, started clashing with the original idea of repository, culminating with calls to never use repositories with EF, calling that an antipattern.

More layers

A lot of confusion is generated by parent-child relationships between models. Like a Department entity with People in it. Should a department repository return a model containing people? Maybe not. So how about we separate repositories into departments (without people) and people, then have a separate abstraction to map then to business models?

The confusion actually increases when we take the business layer and separate it into sublayers. For example, what most people call a business service is an abstraction over applying specific business logic only to a specific type of business model. Let's say your app works with people, so you have a model called Person. The class to handle people will be a PeopleService, which will get business models from the persistence layer via a PeopleRepository, but also do other things, including a mapping between data models and business models or specific work the relates only to people, like calculating their salaries. However, most business logic uses multiple types of models, so services end up being mapping wrappers over repositories, with little extra responsibility.

Now imagine that you are using EF to access the data. You already have to declare a DbContext class that contains collections of entities that you map to SQL tables. You have LINQ to iterate, filter and map them, which is efficiently converted into SQL commands in the background and give you what you need, complete with hierarchical parent-child structures. That conversion also takes care of mapping of internal business data types, like specific enums or weird data structures. So why would you even need repositories, maybe even services?

I believe that while more layers of abstraction may seem like pointless overhead, they increase the human understanding of the project and improve the speed and quality of change. There is a balance, obviously, I've seen systems architected with the apparent requirement that all software design patterns be used everywhere. Abstraction is only useful if it improves code readability and separation of concerns.

Reason

One of the contexts where EF becomes cumbersome is unit testing. DbContext is a complicated system, with a lot of dependencies that one would have to manually mock with great effort. Therefore Microsoft came with an idea: in memory database providers. So in order to test anything, you just use an in memory database and be done with it.

Note that on Microsoft pages this method of testing is now marked with "not recommended". Also note that even in those examples, EF is abstracted by repositories.

While in memory database tests work, they add several issues that are not easy to address:

setting up an in memory DbContext requires all of the dependencies to existing entities
setting up and starting the memory database for each test is slow
in order to get valid database output you need to set up a lot more than what you want to atomically test

Therefore, what ends up happening is that people set up everything in the database within a "helper" method, then create tests that start with this inscrutable and complex method to test even the smallest functionality. Any code that contains EF code will be untestable without this setup.

So one reason to use repositories is to move the testing abstraction above DbContext. Now you don't need a database at all, just a repository mock. Then test your repo itself in integration tests using a real database. The in memory database is very close to a real database, but it is slightly different, too.

Another reason, which I admit I've rarely seen be of actual value in real life, is that you might want to change the way you access the data. Maybe you want to change to NoSql, or some memory distributed cache system. Or, which is much more likely, you started with a database structure, perhaps a monolithic database, and now you want to refactor it into multiple databases with different table structures. Let me tell you right off the bat that this will be IMPOSSIBLE without repositories.

And specific to Entity Framework, the entities that you get are active records, mapped to the database. You make a change in one and save the changes for another and you suddenly get the first entity updated in the db, too. Or maybe you don't, because you didn't include something, or the context has changed.

The proponents of EF always hype the tracking of entities as a very positive thing. Let's say you get an entity from the database, you then do some business, then you update the entity and save it. With a repo you would get the data, then do business, then get the data again in order to perform a little update. EF would keep it in memory, know it wasn't updated before your change, so it would never read it twice. That's true. They are describing a memory cache for the database that is somehow aware of database changes and keeps track of everything you handle from the database, unless instructed otherwise, bidirectionally maps database entries to complex C# entities and tracks changes back and forth, while being deeply embedded in the business code. Personally, I believe this plethora of responsibilities and lack of separation of concerns is a lot more damaging than any performance gained by using it. Besides, with some initial effort, all that functionality can still be abstracted in a repository, or maybe even another layer of memory caching for a repository, while keeping clear borders between business, caching and data access.

In fact, the actual difficulty in all of this is determining the borders between systems that should have separate concerns. For example, one can gain a lot of performance by moving filtering logic to stored procedures in the database, but that loses testability and readability of the algorithm used. The opposite, moving all logic to code, using EF or some other mechanism, is less performant and sometimes unfeasible. Or where is the point where data entities become business entities (see the example above with Department and Person)?

Perhaps the best strategy is to first define these borders, then decide on which technology and design are going to fit into that.

My conclusion

I believe that service and repository abstractions should always be used, even if the repository is using Entity Framework or other ORM underneath. It all boils down to separation of concerns. I would never consider Entity Framework a useful software abstraction since it comes with so much baggage, therefore a repository much be used to abstract it in code. EF is a useful abstraction, but for database access, not in software.

My philosophy of software writing is that you start with application requirements, you create components for those requirements and abstract any lower level functionality with interfaces. You then repeat the process at the next level, always making sure the code is readable and it doesn't require understanding of the components using or the ones used at the current level. If that's not the case, you've separated concerns badly. Therefore, as no business application ever had requirements to use a specific database or ORM, the data layer abstraction should hide all knowledge of those.

What does business want? A filtered list of people? var people = service.GetFilteredListOfPeople(filter); nothing less, nothing more. and the service method would just do return mapPeople(repo.GetFilteredListOfPeople(mappedFilter)); again nothing less or more. How the repo gets the people, saves the people or does anything else is not the concern of the service. You want caching, then implement some caching mechanism that implements IPeopleRepository and has a dependency on IPeopleRepository. You want mapping, implement the correct IMapper interfaces. And so on.

I hope I wasn't overly verbose in this article. I specifically kept code examples out of it, since this is more of a conceptual issue, not a software one. Entity Framework may be the target of most of my complaints here, but this applies to any system that magically helps you in small things, but breaks the important ones.

Hope that helps!

MockManager in unit tests - a builder pattern used for mocks

Published Dec 12, 2024

and has 0 comments

A few years ago I wrote about this, but in less detail. Here is a more refined version of the same idea.

Intro

Unit tests are both boon and bane to developers. They allow quick testing of functionality, readable examples of use, fast experimentation of scenarios for just the components involved. But they also can become messy, need maintenance and update with every code change and, when done lazily, can't hide bugs rather than reveal them.

I think the reason unit testing is so difficult is because it's associated with testing, something other than code writing, and also that unit tests are written in a way opposite to most other code we write.

In this post I will give you a simple pattern of writing unit tests that will enhance all the benefits, while eliminating most of the cognitive dissonance with normal code. Unit tests will remain readable and flexible, while reducing duplicate code and adding no extra dependencies.

How to unit test

But first, let's define a good unit test suite.

To properly test a class, it has to be written in a certain way. In this post we will cover classes using constructor injection for dependencies, which is my recommended way of doing dependency injection.

Then, in order to test it, we need to:

cover positive scenarios - when the class does what it's supposed to do, with various combinations of setup and input parameters to cover the whole functionality
cover negative scenarios - when the class fails in the correct way when the setup or input parameters are wrong
mock all external dependencies
keep all of the test setup, action and assertion in the same test (what is normally called the Arrange-Act-Assert structure)

But that's easier said than done, because it also implies:

setting up the same dependencies for every test, thus copying and pasting a lot of code
setting up very similar scenarios, with just one change between two tests, again repeating a lot of code
generalizing and encapsulating nothing, which is what a dev would normally do in all of their code
writing a lot of negative cases for few positive cases, which feels like having more testing code than functional code
having to update all of these tests for every change to the tested class

Who loves that?

Solution

The solution is to use the builder software pattern to create fluid, flexible and readable tests in the Arrange-Act-Assert structure, while encapsulating setup code in a class complementing the unit test suite for a specific service. I call this the MockManager pattern.

Let's start with a simple example:

// the tested class
public class Calculator
{
    private readonly ITokenParser tokenParser;
    private readonly IMathOperationFactory operationFactory;
    private readonly ICache cache;
    private readonly ILogger logger;

    public Calculator(
        ITokenParser tokenParser,
        IMathOperationFactory operationFactory,
        ICache cache,
        ILogger logger)
    {
        this.tokenParser = tokenParser;
        this.operationFactory = operationFactory;
        this.cache = cache;
        this.logger = logger;
    }

    public int Calculate(string input)
    {
        var result = cache.Get(input);
        if (result.HasValue)
        {
            logger.LogInformation("from cache");
            return result.Value;
        }
        var tokens = tokenParser.Parse(input);
        IOperation operation = null;
        foreach(var token in tokens)
        {
            if (operation is null)
            {
                operation = operationFactory.GetOperation(token.OperationType);
                continue;
            }
            if (result is null)
            {
                result = token.Value;
                continue;
            }
            else
            {
                if (result is null)
                {
                    throw new InvalidOperationException("Could not calculate result");
                }
                result = operation.Execute(result.Value, token.Value);
                operation = null;
            }
        }
        cache.Set(input, result.Value);
        logger.LogInformation("from operation");
        return result.Value;
    }
}

This is a calculator, as is tradition. It receives a string and returns an integer value. It also caches the result for a specific input, and logs some stuff. The actual operations are being abstracted by IMathOperationFactory and the input string is translated into tokens by an ITokenParser. Don't worry, this is not a real class, just an example. Let's look at a "traditional" test:

[TestMethod]
public void Calculate_AdditionWorks()
{
    // Arrange
    var tokenParserMock = new Mock<ITokenParser>();
    tokenParserMock
        .Setup(m => m.Parse(It.IsAny<string>()))
        .Returns(
            new List<CalculatorToken> {
                CalculatorToken.Addition, CalculatorToken.From(1), CalculatorToken.From(1)
            }
        );

    var mathOperationFactoryMock = new Mock<IMathOperationFactory>();

    var operationMock = new Mock<IOperation>();
    operationMock
        .Setup(m => m.Execute(1, 1))
        .Returns(2);

    mathOperationFactoryMock
        .Setup(m => m.GetOperation(OperationType.Add))
        .Returns(operationMock.Object);

    var cacheMock = new Mock<ICache>();
    var loggerMock = new Mock<ILogger>();

    var service = new Calculator(
        tokenParserMock.Object,
        mathOperationFactoryMock.Object,
        cacheMock.Object,
        loggerMock.Object);

    // Act
    service.Calculate("");

    //Assert
    mathOperationFactoryMock
        .Verify(m => m.GetOperation(OperationType.Add), Times.Once);
    operationMock
        .Verify(m => m.Execute(1, 1), Times.Once);
}

Let's unpack it a little. We had to declare a mock for every constructor dependency, even if we don't actually care about the logger or the cache, for example. We also had to set up a mock method that returns another mock, in the case of the operation factory.

In this particular test we wrote mostly setup, one line of Act and two lines of Assert. Moreover, if we want to test how the cache works inside the class we would have to copy paste the entire thing and just change the way we setup the cache mock.

And there are the negative tests to consider. I've seen many a negative test doing something like: "setup just what is supposed to fail. test that it fails", which introduces a lot of problems, mainly because it might fail for completely different reasons and most of the time these tests are following the internal implementation of the class rather than its requirements. A proper negative test is actually a fully positive test with just one wrong condition. Not the case here, for simplicity.

So, without further ado, here is the same test, but with a MockManager:

[TestMethod]
public void Calculate_AdditionWorks_MockManager()
{
    // Arrange
    var mockManager = new CalculatorMockManager()
        .WithParsedTokens(new List<CalculatorToken> {
            CalculatorToken.Addition, CalculatorToken.From(1), CalculatorToken.From(1)
        })
        .WithOperation(OperationType.Add, 1, 1, 2);

    var service = mockManager.GetService();

    // Act
    service.Calculate("");

    //Assert
    mockManager
        .VerifyOperationExecute(OperationType.Add, 1, 1, Times.Once);
}

Unpacking, there is no mention of cache or logger, because we don't need any setup there. Everything is packed and readable. Copy pasting this and changing a few parameters or some lines is no longer ugly. There are three methods executed in Arrange, one in Act and one in Assert. Only the nitty gritty mocking details are abstracted away: there is no mention of the Moq framework here. In fact, this test would look the same regardless of the mocking framework one decides to use.

Let's take a look at the MockManager class. Now this will appear complicated, but remember that we only write this once and use it many times. The whole complexity of the class is there to make unit tests readable by humans, easily to understand, update and maintain.

public class CalculatorMockManager
{
    private readonly Dictionary<OperationType,Mock<IOperation>> operationMocks = new();

    public Mock<ITokenParser> TokenParserMock { get; } = new();
    public Mock<IMathOperationFactory> MathOperationFactoryMock { get; } = new();
    public Mock<ICache> CacheMock { get; } = new();
    public Mock<ILogger> LoggerMock { get; } = new();

    public CalculatorMockManager WithParsedTokens(List<CalculatorToken> tokens)
    {
        TokenParserMock
            .Setup(m => m.Parse(It.IsAny<string>()))
            .Returns(tokens);
        return this;
    }

    public CalculatorMockManager WithOperation(OperationType operationType, int v1, int v2, int result)
    {
        var operationMock = new Mock<IOperation>();
        operationMock
            .Setup(m => m.Execute(v1, v2))
            .Returns(result);

        MathOperationFactoryMock
            .Setup(m => m.GetOperation(operationType))
            .Returns(operationMock.Object);

        operationMocks[operationType] = operationMock;

        return this;
    }

    public Calculator GetService()
    {
        return new Calculator(
                TokenParserMock.Object,
                MathOperationFactoryMock.Object,
                CacheMock.Object,
                LoggerMock.Object
            );
    }

    public CalculatorMockManager VerifyOperationExecute(OperationType operationType, int v1, int v2, Func<Times> times)
    {
        MathOperationFactoryMock
            .Verify(m => m.GetOperation(operationType), Times.AtLeastOnce);
        var operationMock = operationMocks[operationType];
        operationMock
            .Verify(m => m.Execute(v1, v2), times);
        return this;
    }
}

All of the required mocks for the test class are declared as public properties, allowing any customization of a unit test. There is a GetService method, which will always return an instance of the tested class, with all of the dependencies fully mocked. Then there are With* methods which atomically set up various scenarios and always return the mock manager, so that they can be chained. You can also have specific methods for assertion, although in most cases you will be comparing some output with an expected value, so these are here just to abstract away the Verify method of the Moq framework.

A MockManager base class

Mock managers are very useful and make for readable code and nice tests, but they can be tiresome to write. When you want to test hundreds of classes, writing a mock manager for all becomes really annoying. Luckily, you can use a base class that makes this really easy!

So let's rewrite the CalculatorMockManager class with this base class:

public class CalculatorMockManager
    : MockManagerBase<Calculator>
{
    private readonly Dictionary<OperationType, Mock<IOperation>> operationMocks = [];

    public CalculatorMockManager WithParsedTokens(List<CalculatorToken> tokens)
    {
        GetMock<ITokenParser>()
            .Setup(m => m.Parse(It.IsAny<string>()))
            .Returns(tokens);
        return this;
    }

    public CalculatorMockManager WithOperation(OperationType operationType, int v1, int v2, int result)
    {
        var operationMock = new Mock<IOperation>();
        operationMock
            .Setup(m => m.Execute(v1, v2))
            .Returns(result);

        GetMock<IMathOperationFactory>()
            .Setup(m => m.GetOperation(operationType))
            .Returns(operationMock.Object);

        operationMocks[operationType] = operationMock;

        return this;
    }

    public CalculatorMockManager VerifyOperationExecute(OperationType operationType, int v1, int v2, Func<Times> times)
    {
        GetMock<IMathOperationFactory>()
            .Verify(m => m.GetOperation(operationType), Times.AtLeastOnce);
        var operationMock = operationMocks[operationType];
        operationMock
            .Verify(m => m.Execute(v1, v2), times);
        return this;
    }
}

The first thing we notice is that the base class is a generic one, taking as a generic parameter the type of the class we want to test. Then there are no more properties for mocks, the methods setting up mocks use a GetMock<T> method instead. And finally there is no GetService method.

How does it work? Well, when GetService is called, using reflection we find the constructor parameters and find a value to use in them. By default a mock will be generated for each, which then can be accessed with the GetMock<T> method. However, there are two methods that are virtual, allowing to customize the resolution of either the constructor parameter itself or that of its mock. Moreover, you can just decorate a property of the mock manager class with an attribute, and that property value will be used as the constructor parameter of that type.

And if you really liked the idea of Mock properties, then you can define them as read only properties that call GetMock. Here are the base class and the attribute used to decorate properties as constructor parameter providers:

/// <summary>
/// Base class for mock managers
/// </summary>
/// <typeparam name="TSubject"></typeparam>
public abstract class MockManagerBase<TSubject>
    where TSubject : class
{
    protected readonly Dictionary<Type, Mock> mocks = [];
    private TSubject _service;
    private Dictionary<Type, PropertyInfo> _properties;

    public TSubject GetService()
    {
        if (_service is not null) return _service;
        var subjectType = typeof(TSubject);
        var ctors = subjectType.GetConstructors();

        //Currently supports only services with 1 ctor
        var theCtor = ctors.Single();

        var services = new ServiceCollection();
        foreach (var serviceType in theCtor.GetParameters())
        {
            var paramType = serviceType.ParameterType;
            object paramInstance = CreateInstance(paramType);
            services.AddSingleton(paramType, paramInstance);
        }

        var serviceProvider = services.BuildServiceProvider();
        _service = ActivatorUtilities.GetServiceOrCreateInstance<TSubject>(serviceProvider);
        return _service;
    }

    /// <summary>
    /// Override this to have custom values for constructor parameters
    /// </summary>
    /// <param name="type"></param>
    /// <returns></returns>
    protected virtual object CreateInstance(Type type)
    {
        var instance = GetFromProperty(type);
        if (instance is null)
        {
            Mock mock = CreateMock(type);
            mocks[type] = mock;
            instance = mock.Object;
        }
        return instance;
    }

    /// <summary>
    /// Override this to have custom Mocks for constructor parameters
    /// </summary>
    /// <param name="type"></param>
    /// <returns></returns>
    protected virtual Mock CreateMock(Type type)
    {
        var mockType = typeof(Mock<>).MakeGenericType(type);
        var mock = GetFromProperty(mockType) ?? Activator.CreateInstance(mockType);
        return mock as Mock;
    }

    private object GetFromProperty(Type type)
    {
        _properties ??= this.GetType()
            .GetProperties(BindingFlags.Public | BindingFlags.Instance)
            .Where(prop => prop.GetCustomAttribute<ConstructorParameterProviderAttribute>() is not null)
            .ToDictionary(prop => prop.PropertyType, prop => prop);
        if (!_properties.TryGetValue(type, out PropertyInfo prop)) return null;
        return prop.GetValue(this);
    }

    /// <summary>
    /// Get the mock for type <typeparamref name="T"/>
    /// </summary>
    /// <typeparam name="T"></typeparam>
    /// <returns></returns>
    public Mock<T> GetMock<T>()
        where T : class
    {
        GetService(); // ensure mocks are created
        return mocks[typeof(T)] as Mock<T>;
    }
}

/// <summary>
/// Mark a property in a <see cref="MockManagerBase{TSubject}"/> as a provider 
/// for a type or the mock of a type used in constructor injection
/// </summary>
[AttributeUsage(AttributeTargets.Property, AllowMultiple = false, Inherited = true)]
public class ConstructorParameterProviderAttribute : Attribute {}

The test methods should remain unchanged, unless you need a Mock instance outside the mock manager methods, and then you use GetMock.

Here is a more complex mock manager:

public class ExampleMockManager
    : MockManagerBase<Example>
{
    [ConstructorParameterProvider]
    public ExampleDbContext Db { get; } = new MockDbContext();

    protected override object CreateInstance(Type type)
    {
        if (type == typeof(string))
        {
            return DateOnly.FromDateTime(DateTime.Now).ToString("o");
        }
        return base.CreateInstance(type);
    }

    protected override Mock CreateMock(Type type)
    {
        if (type == typeof(IFancyService))
        {
            var mock = CreateFancyMock<IFancyService>();
            return mock;
        }
        return base.CreateMock(type);
    }
}

public class Example
{
    public Example(
        string name, 
        IFancyService fancyService,
        ExampleDbContext dbContext,
        INormalService normalService)
    {
        // ...
    }
}

In this mock manager, the Db property is used to populate the ExampleDbContext constructor parameter, an override of CreateInstance will generate the string as the date of today and an override of CreateMock will create a different type of mock, but just for IFancyService, INormalService gets the regular mock. This is a contrived example, as the following rewrite does exactly the same thing in less code that is much more readable:

public class ExampleMockManager
    : MockManagerBase<Example>
{
    [ConstructorParameterProvider]
    public ExampleDbContext Db { get; } = new MockDbContext();

    [ConstructorParameterProvider]
    public Mock<IFancyService> FancyMock { get; } = new();

    [ConstructorParameterProvider]
    public string Today => DateOnly.FromDateTime(DateTime.Now).ToString("o");
}

Conclusion

This pattern now aligns test writing with code writing:

abstract the things you don't care about in any context
write once and use many times
humanly readable, self documenting code
small methods with low cyclomatic complexity
intuitive code writing

Writing a unit test now is trivial and consistent:

instantiate the mock manager of the class you want to test (or write one based on the steps above)
compose specific scenarios for the test (with auto complete for existing already covered scenario steps)
execute the method you want to test with test parameters
check everything is as expected

The abstraction doesn't stop at the mocking framework. The same pattern can be applied in every programming language! The mock manager construct will be very different for TypeScript or JavaScript or something else, but the unit test would pretty much look the same way.

Hope this helps!

Migrate data from an SQL Server table to another, without locking it

Published Nov 8, 2024

and has 0 comments

Here is the scenario:

you have two tables: a source of recent data and an SQL Server destination table with the same structure that has to be updated with that data
they are both on the same server (we assume that if the data source is external you have copied it into a local temporary table)
the destination table is in active use
the source destination may be in active use also
the tables are large, locking them and waiting for the merge to finish is not an option
the difference between the tables is usually small. Not so small as to be irrelevant, but small relative to the size of the tables
we don't care about full data consistency (any intermediate state between the current and the updated one is acceptable)

There are some quirks in SQL Server that affect our task, mainly that if you try to modify 5000 rows or more of a table, then the locking will escalate from row locks to a table lock, turning it unusable during the merge. Also, for operations involving a lot of data, the log of the database will increase to accommodate that size, then the changes will be persisted and the log invalidated. This takes a lot of disk space and if anything happens during the operation, the entire operation will be rolled back, with all of the data in the log having to be restored, which also takes a lot of time, blocking the database.

The first idea is to find rows that are different, new or missing and just sync 5000 of them at a time. One can do this with a MERGE or classic INSERT, UPDATE, DELETE operations. And this works, but it has a major flaw: after the first 5000 rows have been found and synchronized, the next operation will go through them again anyway to find the next 5000 rows, and again, and again, and again. The execution time will increase exponentially with every new batch of rows.

What we actually need is to take a list of changes and apply them in batches, kind of like an artificial transaction log. The following stored procedure will take two parameters: the full schema name of the source table and the full schema name of the destination table. It will create a log of changes, take 4900 of them at a time and apply them to the destination table. The only restrictions are that the destination table has to have primary key columns and the source and destination tables have the same columns. For performance reasons, it's best that the source table also has the same primary keys or at least an index on the same columns. The usage would look like EXEC usp_MergeTables 'SomeSchemaIncludingDbo.SourceTable','MaybeSomeOtherSchema.DestinationTable'

I will explain what it does after the code:

-- this is for batched merge (del, upd, ins) of the data of a table into another (nolock/readpast for reduced locking of source and destination tables)

CREATE OR ALTER PROC usp_MergeTables(@SourceTable NVARCHAR(256),@DestinationTable NVARCHAR(256))
AS
BEGIN

SET NOCOUNT ON

IF NOT EXISTS(SELECT * 
FROM sys.schemas s
INNER JOIN sys.tables t
ON s.schema_id=t.schema_id
WHERE REPLACE(REPLACE(@SourceTable,']',''),'[','')=s.name+'.'+t.name)
BEGIN

  DECLARE @Err1 VARCHAR(100) = 'Source table '+@SourceTable+' not found!'
  ;THROW 50404,@Err1,1

END

SELECT CAST(s.name as NVARCHAR(Max)) as schemaName,CAST(t.name as NVARCHAR(Max)) as tableName,CAST(c.name as NVARCHAR(Max)) as columnName,c.is_computed,c.is_identity, 
CASE 
  WHEN tp.name IN ( 'varchar', 'char', 'varbinary' ) THEN tp.name + 
                  CASE WHEN c.max_length = -1 THEN '(max)'       
                         ELSE '(' + CAST(c.max_length AS VARCHAR(4)) + ')' END       
  --types that have an unicode character type that requires length to be halved 
  WHEN tp.name IN ( 'nvarchar', 'nchar' ) THEN tp.name + 
                   CASE WHEN c.max_length = -1 THEN '(max)'       
                         ELSE '(' + CAST(c.max_length / 2 AS VARCHAR(4)) + ')'       
                                                              END
   --types with a datetime precision 
  WHEN tp.name IN ( 'time', 'datetime2', 'datetimeoffset' ) THEN tp.name + 
                                            '(' + CAST(c.scale AS VARCHAR(4)) + ')' 
  --types with a precision/scale 
  WHEN tp.name IN ( 'numeric', 'decimal' ) 
  THEN tp.name + '(' + CAST(c.precision AS VARCHAR(4)) + ',' +
                                             CAST(c.scale AS VARCHAR(4)) + ')'
  --timestamp should be reported as rowversion 
  WHEN tp.name = 'timestamp' THEN 'rowversion' 
  --and the rest. Note, float is declared with a bit length, but is 
  --represented as either float or real in types  
  ELSE tp.name 
END as typeName,
ISNULL(( SELECT pk.is_primary_key FROM sys.indexes pk
INNER JOIN sys.index_columns ic
ON ic.object_id = pk.object_id
AND ic.index_id = pk.index_id
AND c.column_id=ic.column_id
WHERE t.object_id = pk.object_id
AND pk.is_primary_key = 1
),0) as is_primary_key
INTO #tgtColumns
FROM sys.schemas s
INNER JOIN sys.tables t
ON s.schema_id=t.schema_id
INNER JOIN sys.all_columns c
ON t.object_id=c.object_id
INNER JOIN sys.types tp
ON c.system_type_id=tp.system_type_id
AND c.user_type_id=tp.user_type_id
WHERE REPLACE(REPLACE(@DestinationTable,']',''),'[','')=s.name+'.'+t.name

IF NOT EXISTS(SELECT * FROM #tgtColumns)
BEGIN

  DECLARE @Err2 VARCHAR(100) = 'Destination table '+@DestinationTable+' not found!'
  ;THROW 50404,@Err2,2

END

IF NOT EXISTS(SELECT * FROM #tgtColumns WHERE is_primary_key=1)
BEGIN

  DECLARE @Err3 VARCHAR(100) = 'Destination table '+@DestinationTable+' has no primary keys!'
  ;THROW 50404,@Err3,3

END

DECLARE @operSql NVARCHAR(Max)
DECLARE @delSql NVARCHAR(Max)
DECLARE @updSql NVARCHAR(Max)
DECLARE @insSql NVARCHAR(Max)
DECLARE @identityInsertOn NVARCHAR(Max)=''
DECLARE @identityInsertOff NVARCHAR(Max)=''

IF EXISTS(SELECT * FROM #tgtColumns WHERE is_identity=1)
BEGIN

  SET @identityInsertOn=CONCAT(N'
SET IDENTITY_INSERT ',@DestinationTable,N' ON
')
  SET @identityInsertOff=CONCAT(N'
SET IDENTITY_INSERT ',@DestinationTable,N' OFF
')

END


SELECT @operSql = CONCAT(N'DROP TABLE IF EXISTS #oper

SELECT *
   INTO #oper
FROM (
SELECT ',STRING_AGG(CONCAT(N'ISNULL(src.[',c.columnName,N'],tgt.[',c.columnName,N']) as [',c.columnName,N']'),N', '),N',
  CASE
    WHEN ',STRING_AGG(CONCAT(N'src.[',c.columnName,N'] IS NULL'),N' AND '),N' THEN ''DEL''
    WHEN ',STRING_AGG(CONCAT(N'tgt.[',c.columnName,N'] IS NULL'),N' AND '),N' THEN ''INS''
    WHEN (
    	SELECT * FROM ',@SourceTable,N' R
    	WHERE ',STRING_AGG(CONCAT('R.[',c.columnName,N'] = src.[',c.columnName,N']'),N' AND '),N'
    	FOR XML PATH(''Row''), ELEMENTS XSINIL
      ) <> (
    	SELECT * FROM ',@DestinationTable,N' R
    	WHERE ',STRING_AGG(CONCAT('R.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'
    	FOR XML PATH(''Row''), ELEMENTS XSINIL
      ) THEN ''UPD''
   END as __oper__
   FROM ',@SourceTable,N' src (NOLOCK)
   FULL OUTER JOIN ',@DestinationTable,N' tgt (NOLOCK)
   ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),'
) x
WHERE __oper__ IS NOT NULL

CREATE INDEX temp_id ON #oper(',STRING_AGG(CONCAT('[',c.columnName,N']'),N', '),N')
CREATE INDEX temp_oper ON #oper(__oper__)')
FROM #tgtColumns c
WHERE c.is_primary_key=1

SELECT @delSql = CONCAT(N'

DECLARE @batch TABLE(',STRING_AGG(CONCAT('[',c.columnName,N'] ',c.typeName),N', '),N',
                   PRIMARY KEY(',STRING_AGG(CONCAT('[',c.columnName,N']'),N', '),N'))

DECLARE @ROWS INT = 1

WHILE (@ROWS>0)
BEGIN

  DELETE TOP (4900) tgt
    OUTPUT ',STRING_AGG(CONCAT('deleted.[',c.columnName,N']'),N', '),N'
	INTO @batch
  FROM ',@DestinationTable,N' tgt (READPAST)
  INNER JOIN #oper src
    ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'
  WHERE src.__oper__=''DEL''

  SET @ROWS=@@ROWCOUNT

  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''DEL''

END')
FROM #tgtColumns c
WHERE c.is_primary_key=1


SELECT @updSql = CONCAT(N'

SET @ROWS = 1

WHILE (@ROWS>0)
BEGIN

  UPDATE tgt
  SET ',(SELECT STRING_AGG(CONCAT('[',c.columnName,N'] = src.[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_primary_key=0 AND c.is_computed=0),N' OUTPUT ',STRING_AGG(CONCAT('inserted.[',c.columnName,N']'),N', '),N'
	INTO @batch
  FROM ',@DestinationTable,N' tgt (READPAST)
  INNER JOIN ( 
    SELECT TOP (4900) s.*
    FROM #oper o
	INNER JOIN ',@SourceTable,N' s
	ON ',STRING_AGG(CONCAT('s.[',c.columnName,N'] = o.[',c.columnName,N']'),N' AND '),N'
    WHERE __oper__=''UPD''
  ) src
    ON ',STRING_AGG(CONCAT('src.[',c.columnName,N'] = tgt.[',c.columnName,N']'),N' AND '),N'

  SET @ROWS=@@ROWCOUNT

  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''UPD''

END')
FROM #tgtColumns c
WHERE c.is_primary_key=1


SELECT @insSql = CONCAT(N'

SET @ROWS = 1

WHILE (@ROWS>0)
BEGIN

  INSERT INTO ',@DestinationTable,N'(',(SELECT STRING_AGG(CONCAT('[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_computed=0),N') OUTPUT ',STRING_AGG(CONCAT('inserted.[',c.columnName,N']'),N', '),N'
    INTO @batch
  SELECT TOP (4900) ',(SELECT STRING_AGG(CONCAT('s.[',c.columnName,N']'),N', ')
FROM #tgtColumns c
WHERE c.is_computed=0),N'
  FROM #oper o
  INNER JOIN ',@SourceTable,N' s
  ON ',STRING_AGG(CONCAT('s.[',c.columnName,N'] = o.[',c.columnName,N']'),N' AND '),N'
  WHERE __oper__=''INS''

    SET @ROWS=@@ROWCOUNT
	
  DELETE o
  FROM #oper o
  INNER JOIN  @batch b
  ON ',STRING_AGG(CONCAT('o.[',c.columnName,N'] = b.[',c.columnName,N']'),N' AND '),N'
  
  DELETE FROM @batch

  IF (@ROWS=0)
    SELECT @ROWS=COUNT(*) FROM #oper WHERE __oper__=''INS''

END

DROP TABLE #oper
')
FROM #tgtColumns c
WHERE c.is_primary_key=1

DROP TABLE #tgtColumns

--PRINT @operSql
--PRINT @delSql
--PRINT @updSql
--PRINT @identityInsertOn
--PRINT @identityInsertOff
--PRINT @insSql

DECLARE @sql NVARCHAR(Max) = CONCAT(@operSql, @delSql, @identityInserton, @updSql, @insSql, @identityInsertOff)
EXEC sp_sqlexec @sql

END

OK, this is a large thing, but the principles used are simple:

first we create a table containing information about the columns of the tables: schema, table, column name, whether the column is primary key or computed, and the type name. The schema and table name are not used, but may be useful for debugging. Note that this SP doesn't check the tables have the same number and type of columns. That's on you to ensure.
using the column information we create four strings that will contain the SQL for the following operations:
- create an "operations table"
- delete rows that are not needed
- update rows that need to be updated
- insert rows that are missing
there are four strings mostly for debugging purposes to keep them smaller than the 8000 characters that Microsoft SQL Server Management Studio can print at a time, but they are concatenated and executed as one.
implementation details:
- we use FOR XML PATH('Row'), ELEMENTS XSINIL to generate a string with all the data in each row, so we don't have to compare rows column by column. We could have made this work with comparisons, but the code would have been bulky and ugly when comparing for NULL or for having values. ELEMENTS XSINIL will ensure that there is a difference between empty space and NULL.
- a FULL OUTER JOIN is used to find (based on the primary key columns of the destination table) if rows need to be either deleted, updated or inserted. That operation is specified in the __oper__ column
- the operations table is thus created, containing the primary key values and the operation required with two indexes: one on the primary keys and one on the operation. These indexes are not really that relevant, so one could choose to remove them.
- a @batch table variable is used with a PRIMARY KEY on the primary key columns of the destination table, for performance reasons
- the batching is done via a DELETE... OUTPUT operation. We delete the 4900 rows we process and we output them in the @batch table
- for each segment we either: delete destination rows with the same primary keys for 'DEL', update destination rows with the same primary keys for 'UPD' and insert source rows with the same primary keys for 'INS'
- before we update, we set IDENTITY_INSERT to ON and at the end to OFF, if any identity columns
- if some rows were affected by the operation, then we continue in the same segment. If not, then we look in the operations table to see if there are still rows to be processed. No rows may be affected while there are still rows to be processed due to the locking avoidance hints
improvements to avoid locking:
- when we generate the operations table we use NOLOCK, which reads uncommitted values ignoring locks. This may not be what you want, but if the source table is locked for whatever reason, this ensures the merge operation is not blocked
- when we process the batches we use READPAST, which ignores locked destination rows. This ensures that rows that can be updated will be, while the others can be done later, meaning that if you have 1 locked row, the merge operation will go around it, then wait until it is unlocked to update or delete it.

If you want to see what the generated SQL looks like, uncomment the PRINT lines and comment the EXEC one.

Now, I just wrote this stored procedure, so I may have missed some cases. Let me know if you find a situation where this doesn't work as expected.

Hope it helps!

SELECT INTO quirk I had no idea about

Published Nov 8, 2024

and has 0 comments

What is the structure of a table created from another via SELECT * INTO [Second] FROM [First] ?

A simple question, indeed, a basic one, but one that I have never asked myself until today. I honestly believed that the result is a generic table containing the same column names and types and their values and nothing else.

The answer, though, is strange. If the original column is NOT NULL, the resulting column will also be NOT NULL. If the original column has IDENTITY, the column in the second table will also have IDENTITY. And, as you know, you can't add or remove IDENTITY from existing columns without dropping and adding them back. The DEFAULT value, though, is not transferred.

The Microsoft reference page says:

The format of new_table is determined by evaluating the expressions in the select list. The columns in new_table are created in the order specified by the select list. Each column in new_table has the same name, data type, nullability, and value as the corresponding expression in the select list.

...

When an existing identity column is selected into a new table, the new column inherits the IDENTITY property, unless one of the following conditions is true:

The SELECT statement contains a join.

Multiple SELECT statements are joined by using UNION.

The identity column is listed more than one time in the select list.

The identity column is part of an expression.

The identity column is from a remote data source.

...

Indexes, constraints, and triggers defined in the source table are not transferred to the new table, nor can they be specified in the SELECT...INTO statement. If these objects are required, you can create them after executing the SELECT...INTO statement.

...

Specifying an ORDER BY clause does not guarantee the rows are inserted in the specified order.

...

When a computed column is included in the select list, the corresponding column in the new table is not a computed column. The values in the new column are the values that were computed at the time SELECT...INTO was executed.

So, my assumptions were wrong, but still what do you do when you want to achieve exactly that: create a new schema-less table with the same columns in terms of name and data type and value only? It's a bit ugly, but I have not found a better alternative. Just UNION ALL with a SELECT that has the same number of (nullable) columns and no rows. like this:

SELECT *
  INTO [Second]
FROM [First]
  UNION ALL
(SELECT NULL, NULL,NULL,NULL WHERE 1=0)

Assuming that the First table had four columns, this will result in a table having the same column names, types and values as the original table, but all columns will be nullable and with no identity.

Hope it helps!

Firefox and content_scripts.world

Published Sep 13, 2024

Posted in
programming
browser

and has 1 comment

I have been maintaining a Chrome browser extension I wrote for more than a year now and I always get the occasional user asking me if I can make it work with Firefox. And until now I said no, because I use the "world" feature in manifest.json. But I was wrong.

You see, Google is aggressively pushing for manifest version 3, discontinuing support for browser extensions using v2. So my extension was written from the beginning with version 3. In order to load the scripts and CSS files that I needed to run in the context of the page, I used the world:"MAIN" feature. When I tried it with v2, Chrome immediately told me "The 'world' property is restricted to extensions with 'manifest_version' set to 3 or higher." (Or higher. lol) So when I looked for how to use world in Firefox manifest v3 I got a lot of documentation about Xray Vision and how they absolutely refuse to implement the world feature.

I googled, I scoured the web, I read all of Stack Overflow, I even tried LLMs like ChatGPT of Gemini to hilarious results, like Gemini accusing me I want to circumvent the security of the browser. No. There is no simple way of doing the same thing in Manifest v3 on Firefox. Case closed, right? I mean, who even uses Firefox? Less than 2.5% of people on the Internet. And it's a shit browser.

But no, here comes a guy and tells me that he made it work. HOW?! By now you probably guessed it, I left enough hints after all. You just have to use manifest version 2! 'world' is only restricted for version 3 in Chrome browsers!!!

Although now that I am writing this post I see stuff like Manifest V3 updates landed in Firefox 128 so maybe it something that works only recently (the article is from July 2024). Hmm..

I guess there is a silver lining to the fact I refused users for a year now, because I had a year in which I didn't have to deal with the buggy debugger (heh!) in Firefox.

Anyway, if you have the same problem, that's your solution: make a v2 manifest for Firefox and a v3 manifest for Chromium browsers. Or use it in v3 as well, because apparently Firefox changed their minds. And I wrote this blog with such glee that I would help people that had the same problem as me. Ugh!

NO! I will not make it work for Safari! Eat a dick!

Huge SQL performance gains from using INTs

Published Sep 3, 2024

Posted in
programming
sql
picture

and has 0 comments

Just a few days ago I was writing on how important it is to tell Entity Framework what SQL type to use in order to avoid costly conversions. In fact, it wasn't so much an EF issue as it was an SQL one. Converting even character types to character types or changing collation was surprisingly expensive. In this post I will show you how important it is to choose the right type for your querying columns, especially the primary key.

First imagine this scenario: you get some data from an outside source, rows and rows of it, and you have to store them and query them in SQL. The value that uniquely identifies a row is a small string, maybe 50 characters long. How do you proceed?

My first naive solution was the most obvious one: just create a table that has a column for each value in the rows and put the primary key on the identifying one. But this leads to immediate performance losses:

by default, a primary key is a clustered index - text is not sequential, so at every insert the database engine will physically move huge swaths of data in order to place the rows in the alphabetical order of their identifiers
a primary key is a unique index - meaning text will have to get compared to other text in order to determine uniqueness, which is slow
by default, SQL is case insensitive - meaning that all text comparisons will have to be made taking into account capitalization and accents
50 characters is a lot - even without Unicode support, it's 50 bytes, which is 12 times more than an integer, meaning the primary key index will be large; and slow

"But!", you will undoubtedly say, if you put the primary key on some other column, you will still have to create a unique index on the identifier. Isn't this just pushing the problem farther down the road? The size and speed limitations will be the same. And primary keys are clustered only by default, but they can be declared as not clustered. And SQL doesn't need to be case insensitive, all you have to do is change the collation of the column to be binary and it will be compared faster. Wouldn't that solve the problem?

No. In fact, my final solution which worked five times faster, did not have an index on the identifier column AT ALL. Incidentally, I did end up changing the collation, but only because the idiots sending me the data were doing it case sensitive.

Without further ado, here is what I did:

an INT column with IDENTITY(1,1) as the primary key - which ensures a fast insertion due to the sequential nature of the value, fast query speed and low usage of disk space for the index
an INT column holding the checksum of the identifier - which when indexed, is fast to query and doesn't use a lot of disk space for the index

So how do I query on the identifier? Simple: I calculate the checksum of the string and then I look it up in the database - which uses the index to locate the few strings that have the same checksum, then just finds the right one by enumerating through them. I query on the checksum column AND the text identifier. And there is an added bonus: I only need to do this once. If I need the record from the DB again, I query it directly through the integer primary key.

Entity Framework has this automatic memory cache so when I am querying on the database entity using a business model - as good separation of concerns practice would dictate - it gets it really fast from memory. Because the memory cache also uses just the int to identify an entity, which means double the benefits!

The eagle eyed reader will have noticed that I am not using a unique index on the identifier, so technically I could create multiple rows with the same one. However, my application is always looking for the existing record first. But if you really worry about data consistency, the index on the checksum column can be replaced with a unique index on the checksum and identifier column. It will take more space, but it will be just as fast.

Another thing that you may have noticed is that I use a code checksum, not the database provided functions to achieve the same. At first glance, it's an instant win: just create a persisted computed column that calculates the checksum or binary checksum of the identifier column. However, this would be weird when having to query, since you would have to craft a stored procedure or a custom SQL command to get the identifier and query on its checksum. In my case I just calculate a checksum - and not use the lazy string.GethashCode function which may be subject to change and it's already different between 32 and 64 bit systems.

Of course, if you want your text columns to be case and/or accent insensitive, you will have to store the hash code of the lowercase and unaccented string or use an implementation that is case and accent insensitive. This may not be trivial.

Further tests showed that just using a non clustered index on the identifier column, even a unique one, was just slightly slower, maybe 5%. However, the space taken by indexes increased by 20%. So I might understand why you would find it a bit off putting and skip the checksum part.

Hope this helps!

P.S. Why did this solution provide such a huge performance gain? Obviously the SQL team would have implemented a sort of checksum for their text index, this should have been working natively and faster than any possible implementation I could make. Well, I don't know the answer. In fact, this all could be some quirk of Entity Framework and the SQL queries would not be optimizable to such a degree. I will attempt to test that using purely SQL commands. But meanwhile, all the points I made above are valid and with a little more work you can have a lot more control on how the system works.

EF Core 6 - correct types halving the execution time!

Published Aug 30, 2024

and has 0 comments

I've built an application and, like any lazy dev out there, I focused on the business logic, the project structure, the readability, comments, the dependency injection, the unit tests, you know... the code. My preference is to start from top to bottom, so I create more and more detailed implementations of interfaces while going down to the metal. The bottom of this chain is the repository, that class which handles database access, and I've spent little to understand or optimize that code. I mean, it's DB access, you read or you write stuff, how difficult can it be?

When it was time to actually test it, the performance of the application was unexpectedly bad. I profiled it and I was getting reasonable percentages for different types of code, but it was all taking too long. And suddenly my colleague says "well, I tried a few things and now it works twice as fast". Excuse me?! You did WHAT?! I have been trying a few things too, and managed to do diddly squat! Give me that PR to see what you did! And... it was nothing I could see.

He didn't change the code, he just added or altered the attributes decorating the properties of models. That pissed me off, because I had previously gone to the generated SQL with the SQL Profiler and it was all OK. So I executed my code and his code and recorded the SQL that came out:

was it the lazy loading? Nope. The number of instructions and their order was exactly the same
was it the explicit declaration of the names of indexes and foreign keys? Nope. Removing those didn't affect performance.
was it the ChangeTracker.LazyLoadingEnabled=false thing? Nope, I wasn't using child entities in a way that could be affected.
was there some other structure of the generated SQL? No. It was exactly the same SQL! Just my code was using thousands of CPU units and his was using none.
was it magic? Probably, because it made no sense whatsoever! Except...

Entity Framework generates simple SQL queries, but it doesn't execute them as you and I would. It constructs a string, then uses sp_executesql to run it. Something like this:

exec sp_executesql N'SELECT TOP(1) [p].[ID], [p].[TXT], [p].[LUP_TS]

FROM [sch].[table] AS [p]

WHERE [p].[ID] = @__p_0',N'@__p_0 nvarchar(64)',@__p_0='xxxx'

Do you see it? I didn't until I started to compare the same SQL in the two versions. And it was the type of the parameters! Note that the aptly named parameter @__p_0 is an NVARCHAR. The actual column in the database was VARCHAR! Meaning that the code above was unnecessarily always converting values in order to compare them. The waste of resources was staggering!

How do you declare the exact database type of your columns? Multiple ways. In my case there were three different problems:

no Unicode(false) attribute on the string columns - meaning EF expected the columns to be NVARCHAR
no Typename parameter in the Column attribute where the columns were NTEXT - meaning EF expected them to be NVARCHAR(Max)
- I guess one could skip the Unicode thing and instead just specify the type name, but I haven't tested it
using MaxLength instead of StringLength - because even if their descriptions are very similar and MaxLength sounds like applying in more cases, it's StringLength that EF wants.

From 40-50ms per processing loop, it dropped to 21ms just by fixing these.

Long story short: parametrized SQL executed with sp_executesql hides a possible performance issue if the columns that you compare or extract have slightly different types than the one of the parameters.

Go figure. I hate Entity Framework!

EF Core 6 - This SqlTransaction has completed; it is no longer usable.

Published Aug 20, 2024

and has 0 comments

Intro

This post is about the System.InvalidOperationException: This SqlTransaction has completed; it is no longer usable. which may be because you shared your SqlConnection or you tried to SaveChanges twice and all of the other issues that you can google for. I was not so lucky. I spent a day and a half to understand what's going on and only with a help of another dev did I get close to the issue.

TL;DR;

I used a column with identity generation, but it wasn't also a primary key and EF sucks.

Details

Imagine my scenario first: I wanted to use a database to assign a unique integer to a string. I was first searching for the entry in the DB and, if not found, I would just insert a new one. The SQL Server IDENTITY(1,1) setting would insure I got a new unique value for the inserted row. So the table would look like this:

CREATE TABLE STR_ID (
  STR NVARCHAR(64) PRIMARY KEY,
  ID INT IDENTITY(1,1)
}

Nothing fancy about this. Now for the C# part, using Entity Framework Core 6.

I created an entity class for it:

[Table("STR_ID")]
public class StrId {

  [Column("STR")]
  [Key]
  public string Text { get; set; }

  [Column("ID")]
  [DatabaseGenerated(DatabaseGeneratedOption.Identity)]
  public int Id { get; set; }

}

And then I proceeded to test it in the following way:

create a DbContext instance
search for a value by STR/Text in the proper DbSet
if it doesn't exist, insert a new row and SaveChanges
retrieve the generated id
dispose the context

I also ran this 20 times in parallel (well, as Tasks - a minor distinction, but it was using the thread pool).

The result was underwhelming. It would fail EVERY TIME, with either an exception about deadlocks or

System.InvalidOperationException: This SqlTransaction has completed; it is no longer usable.
   at Microsoft.Data.SqlClient.SqlTransaction.ZombieCheck()
   at Microsoft.Data.SqlClient.SqlTransaction.Commit()

I did what every sane developer would do in this situation and bought myself a shotgun (we all know it's the most effective against zombies) then googled for other people having this issue. I mean, it would be common, right? You do some EF stuff in parallel and you get some errors.

No. This is happening in a parallelism scenario, but that's not the cause. Also, it's not about transactions. EF will wrap SaveChanges operations in a transaction and that is causing the error, but the transaction being completed is the issue and no, it's not your code!

I tried everything I could think of. I disabled the EF transaction and made my own, using all types of IsolationLevel, I tried EnableRetryOnFailure with hilarious results (I was monitoring the values inserted in the database with NOLOCK and they were going to 20, then back to 10, then 15, then back to 1 and it was taking ages trying to retry operations that apparently had dependencies to each other, only to almost all to fail after a long time). I even disabled connection pooling, which probably works, but would have made everything slow.

Solution

While I can't say what EXACTLY caused the problem (I would have to look into the Microsoft code and I don't feel like it now), the solution was ridiculously simple: just make the IDENTITY column a primary key instead:

CREATE TABLE STR_ID (
  ID INT PRIMARY KEY IDENTITY(1,1),
  STR NVARCHAR(64)
}

-- because this is what I am searching for
CREATE UNIQUE INDEX IX_STR_ID_STR ON STR_ID(STR)

[Table("STR_ID")]
public class StrId {

  [Column("ID")]
  [Key]
  public int Id { get; set; }

  [Column("STR")]
  public string Text { get; set; }

}

I was about to use IsolationLevel.ReadUncommitted for the select or just set AutoTransactionsEnabled to false (which also would have solved the problem), when the other guy suggested I would use this solution. And I refused! It was dumb! Why the hell would that work? You dummy! And of course it worked. Why? Donno! The magical thinking in the design of EF strikes again and I am the dummy.

Conclusion

What happened is probably related to deadlocks, more specifically multiple threads trying to read/write/read again from a table and getting in each other's way. It probably has something to do with how IDENTITY columns need to lock the entire table, even if no one reads that row! But what it is certain to be is a bug: the database functionality for a primary key identity column and a unique indexed identity column is identical! And yet Entity Framework handles them very differently.

So, in conclusion:

yay! finally a technical post
this had nothing to do with how DbContexts get disposed (since in my actual scenario I was getting this from dependency injection and so I lost hours ruling that out)
the error about transactions was misleading, since the issue was what closed the transaction inside the Microsoft code not whatever you did
the advice of some of the AggregateExceptions up the stream (An exception has been raised that is likely due to a transient failure. Consider enabling transient error resiliency by adding 'EnableRetryOnFailure' to the 'UseSqlServer' call.) was even more misleading
the EF support for IDENTITY columns - well, it needs it because then how would it know not to attempt to save values in those columns - is also misleading, because it doesn't mean it's good support
while parallel access to the DB made the problem visible, it has little to do with parallelism
EF knows how to handle PRIMARY KEYs so that's the solution
EF sucks!

I really hope this saves time for people in the same situation!

LiChess Tools used for the Lichess Sharjah Masters 2024!

Published May 13, 2024

and has 2 comments

Happy Birthday, LiChess Tools!

It was one year ago that LiChess Tools was first published on GitHub. It was like the birth of a child, having spent a few weeks in gestation as an extension I would only use for myself. It was doing just the simplest of things at the time:

opening friends box automatically on page load, so that you see if you have any friends online
making sound alerts when friends started playing, so you can watch your favorite chess players the moment they start a game
pressing Shift-Right to choose a random variation from current position in analysis or study
sticky Preview mode in studies, so that you can move from chapter to interactive chapter and keep playing without seeing the move list
setting a minimum engine depth, so that the local engine would run automatically if a lower cloud depth was displayed

It had the Preferences in the extension popup, because they were so few features. The code was so awful that I didn't dare tell anyone about it.

Now this has become a behemoth with close to 100 different tools and bright prospects for the future.

I would like to thank the community, such as it is, because even if only one person was giving me feedback in a month, it could have happened when I was feeling low or stressed or unmotivated and it would perk me up immediately. Thank you a lot, guys!

For some weird reason (all passionate devs are weird) there was nothing more motivating than some kid wanting a feature, first thinking it was impossible, then getting the nagging feeling that I should think about it more, then finding a brilliant lateral solution, implementing it, improving on it, then delivering everything within the hour only to get a bored "thanks" at the end. But that thanks was all I needed to carry on. Occasionally I get thankful notes from people and it makes my day.

Right now LiChess Tools has 2500 daily users and 26 ratings that average to 4.8 stars. It's not the quantity, but the quality, though. The extension is focused on chess analysis and ease of learning. It's basically a pro tool, aimed at chess enthusiasts, coaches, schools and chess professionals. With such a scope, 2500 users is huge! And we'll get even higher.

At the time of this writing, plans are in motion to use the OBS integration feature of LiChess Tools for the official Lichess Sharjah Masters broadcast on the 14th of May, presented by WIM Irene Kharisma Sukandar. Oooh, I hope it doesn't break midway 😱

[youtube:kslb3y4W7RM]

But there is more! I am working with the Lichess devs to streamline some of the more hackish features of the extension so that it can be used en masse without bothering the Lichess servers. I've received some suggestions from GMs and IMs and chess coaches that I will implement in LiChess Tools and I will support a plan to update the chess openings list in Lichess (as well as in Lichess Tools).

So there are some great opportunities open to the extension in the near future and hopefully they will make this blossom into something even more special!

The next major version (3.*) will probably restructure the features into more mature tools, focus on performance and adding more "epic" features, like:

full Client Side Analysis - including brilliant/good/best move detection ideas, statistics and charts
a more complete and user friendly Explorer Practice module
Chessable-like interface for Studies and spaced repetition features

There is also time for a rebranding. I am tired of people thinking I am talking about the Tools menu in Lichess. Right now the best idea I have is Prometheus for Lichess. I just hope Thibault is not going to nail me to a mountain and sic the Lichess owl on my liver. Perhaps you guys can come with better ideas.

Rebranding doesn't come with corporate goals and premium tiers, though. LiChess Tools will always be free, regardless of its name, so don't worry.

So, let's celebrate by singing along with the official LiChess Tools theme and hope for an even more awesome year!

It's made with AI, so it's cool by default 😁

Enjoy chess, Lichess and LiChess Tools!

P.S. Bring me your stories, people! I want to know how you use the extension. Join the LiChess Tools users team and share your experience with all of us.

In class members which are functions, this is not defined unless you ask for it

Published Dec 19, 2023

and has 0 comments

You are writing some Javascript code in the browser and you create classes, then you create some methods, which you see have some issue with the "this" keyword by default if you use the standard method declaration so you end up doing something like this:

// instead of myMethod() { ... }
myMethod=()=>{
  const myClass=this;
}

And this (pardon the pun) works here. Imagine my surprise when I did something that seemed identical:

// instead of myMethod() { ... }
myMethod=()=>{
  debugger;
}

And then I tried to see what "this" was in the Chrome Developer Tools. It was "undefined". Why?!

Long story short, I then tried this (heh!) and it worked:

// instead of myMethod() { ... }
myMethod=()=>{
  const myClass=this;
  debugger;
}

The moral of the story is that "this" is not declared unless used in code. Probably a browser optimization. Hope it saves you the ten minutes it took me to understand what was going on, after I've started doubting my entire JavaScript expertise and the entire design of my app.

The Pawn Structure feature in LiChess Tools

Published Nov 28, 2023

Posted in
programming
misc
chess

and has 0 comments

As you know, LiChess Tools is my own Chromium browser extension for lichess.org, adding a lot of stuff to it. Recently, from version 2.2.0 on, I've added a new feature called Show Pawn Structures. This post explains in detail what it is, what it does and how it does it.

Pawn Structures

What is a pawn structure? The configuration of pawns on the chessboard. Because pawns are the least mobile of the chess pieces, the pawn structure is relatively static and thus plays a large role in determining the strategic character of the position. It is a powerful chess concept that is mostly ignored on amateur level and I've met 2000+ rated players who didn't know what that was. Not that I know, either, which is why I got so excited to build this feature because it would further chess understanding and learning. With different structures come medium term plans, so instead of having the opaque engine recommendations of making one move or another, you will have a general idea on where to take the game to.

The above is the chess definition of the concept, though. In order to work with it in an algorithm it has to be clearly defined. The difficulty here lies in the fact that while the pawn structure is "relatively static" its meaning is not. While you will be shown a specific pawn configuration in the context of a named structure, it would be implied that other similar configurations also belong. That similarity being not precise, but something nebulous related to the general ideas and themes that are made possible by the structure.

Feature requirements

The purpose of the feature is to determine the pawn structure of a position in either game analysis, analysis board, studies, TV games and mini-games (the things that appear when you hover on a playing user link or in the Current Games section), then display it, similar to the Show Opening feature. The reasoning here is that one can learn to classify positions and thus know the general plans that apply in the situation.

Technical details

There is a list of pawn structures that LiChess Tools supports. The list is at the end of this post. In order to the structure I created a textual representation of them, that looks something like this: 312100TX 0X0210 2020 XXLXXX XXXXXX XXXX. A bit daunting, but the most important part is the first group of characters: 312100TX.

The idea is that the first characters are the most significant, so similar pawn structures would start with the same letters and digits, even if they diverge later on. Based on the structures detailed by various books on the matter, I've considered that the position of the d-pawn is the most significant, followed by the e and c pawns, then the pawn majority on the White and Black sides, followed by the other pawns: f,b,g,a,h. The final part is doubled or triple pawns, which most of the time is irrelevant.

So let's get back to the complicated string above: 312100TX 0X0210 2020 XXLXXX XXXXXX XXXX (we will assume White board orientation)

312 - my pawn position on the d,e,c files: d5 (3 squares forward), e3 (1 square), c4 (2 squares) - the possible characters for this group are X (missing pawn), 0 (unmoved pawn), 1,2,3 (squares the pawn is forward from its starting position)
100 - their pawn position on the d,e,c files: d6, e7, c7
TX - the majority on the queenside and kingside: T (they) have majority on the queenside, and equality on the kingside - the possible characters for this group are M (me), T (them) or X (neither)
0X0 - my pawn position for the f,b,g files
210 - there pawn position for the f,b,g files
20 - my pawn position for the a,h files
20 - my pawn position for the a,h files
XXLXXX XXXXXX XXXX - similar to above groups, doubled or tripled pawns - X for no, L for yes

Most pawn structures are exclusively about the position of the c,d and e file pawns, therefore the first group of characters will be the most determining one. This may change in the future, perhaps, as better chess players than me will point out issues with this system, but for the moment this is how it works.

Based on this string we can compare the current pawn position to the one of the named pawn structures. There are three options in the Preferences for this feature:

Enabled - if this feature is on or off
Only named structures - will only display the named structures if enabled. If not, then the name will be displayed as the most significant group of characters in the structure representation. On mouseover, the title will show the entire thing as well as the most similar named structure found.
Fuzzy - if this feature is on, then a structure will be considered named if 90% similar to the standard one.

The feature will invert the structure and look for a name there if not found for your own orientation. If a name is found there, "(R)" will be added to the name.

Note that the most named structures are represented by the most significant group only, and only several of them by the first two or three groups of characters. The rest is ignored.

Similarity

Now, how do we compute the similarity? It's a work in progress, but at the moment it works like this:

the number of common characters on the same position from the beginning of the text are counted as 1
the number of other common characters on the same position (so after any position where the characters were different) are counted as 0.8
the number of positions where the current structure has pawns on squares that may allow moving them on the named structure square for that position (so where the value is numerical and smaller than the numerical value of the named structure) are counted as 0.2
the percentage of the resulting sum from the characters counted is returned as the result

Example: Carlsbad vs Slav
21X2X0TM
21XX11TX
11100280 (8 denotes 0.8 and 2 denotes 0.2 here) = 4/8 = 50%

It may be that this is not the best way and it might be changed in the future.

List of structures

A small note on naming: different sources name these differently or don't recognize them as structures at all. I did what I could to give a short identifiable name to each position, but I think sooner or later I will have to restrict the number of names, rather than increase it. We'll see.

So here is the list of pawn structures recognized by LiChess Tools (v2.2.3):

Queen's Isolani - 2XXX1XXT
Italian Isolani - 2XXXX0TX
Hanging Pawns - 2X2X1XTT
Carlsbad - 21X2X0TM
Orthodox Exchange - 21X2X1TM
Rauzer Formation - X22X21XX
Boleslavsky Wall - X221X1TM
Scheveningen - X2011XMX
Dragon - X2010XMX 000001
Maróczy Bind - X2210XMX and X2201XMX
Hedgehog - X2211XMX 000010
Boleslavsky Hole - X2012XMX
d5 Chain - 322120XX
e5 Chain - 230210XX
Panov - 2X321XMT and 2X320XMT
Stonewall - 200210XX 000200
Botvinnik - 122122XX
Slav - 21XX11TX
Caro - 2X1X11MT
Modern Benoni - 32X1X2TM
Symmetric Benoni - 3X21X2XX
Closed Sicilian - 120102XX
Grünfeld - 221X02TX 0X0000
Najdorf - 3X012XMT and X20X22XX and X2022XMX
KID (King's Indian Defense) I - 32X12XXX
KID (King's Indian Defense) II - 322122XX
KID (King's Indian Defense) Complex - 222121XX
Open KID (King Indian's Defense) - X221X0TM
French - 2X0210XM and X3121XMX
3-3 vs 4-2 - XX1X1XMT
Benko - 32X102MX 0000X1 00X0
Closed Ruy Lopez - 321123XX
Lopez - 221122XX
Maróczi SymT (Symmetrical Transformation) - 32X10XXX
Maróczi AsymT (Asymmetrical Transformation) - 3X210XMT
KID II TA (Transformation A)- 3X2122XM 00020X
KID II TB (Transformation B) - 3X2122XM 000X00
d5 Chain TA (Transformation A) - 3X2120XM 00020X
d5 Chain TB (Transformation B) - 3X2120XM 000X00
Lopez Exchange (double pawn exchange on d4) - X2X1XXXM

The links above are also used in LiChess Tools and are mostly from Wikipedia, but also some approximations or just other random sites because there are no links for many of the Flores structures. I would be very happy if someone would help me clean these up.

Hope this explains everything. Enjoy!