As I’m currently thinking about creating another gameboy emulator (but with c# this time), I was curious if the cpu op-codes could be implemented using asyncand await. Because the internal timing of the op-codes heavily depend on wait times for the data bus. For example if you want to read a 16 bit integer from memory and you only have an 8 bit data bus, the read operation requires two seperate reads. But the memory will only give you one byte each clock.

The first thing I did was to create a simple awaitable class which should later act as the internal clock of the emulator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class AwaitableSignal
{
    private List<WaitHandle> fWaitHandles = new List<WaitHandle>();
    private List<WaitHandle> fNextHandles = new List<WaitHandle>();


    internal class WaitHandle : INotifyCompletion
    {
        private Action fContinuation;

        public bool IsCompleted { get; private set; } = false;

        public void OnCompleted(Action continuation)
        {
            fContinuation += continuation;
        }

        internal void GetResult()
        {
        }

        public void Complete()
        {
            IsCompleted = true;
            fContinuation?.Invoke();
        }
    }

    internal WaitHandle GetAwaiter()
    {
        var wh = new WaitHandle();
        fWaitHandles.Add(wh);
        return wh;
    }


    public void Signal()
    {
        (fNextHandles, fWaitHandles) = (fWaitHandles, fNextHandles);
            
        foreach (var i in fNextHandles)
            i.Complete();

        fNextHandles.Clear();
    }
}

This simple clock just keeps track of some clients in two lists which are exchanged on every clock so they don’t need to be reallocated each time.

Using this awaitable class my sample CPU code was able to “await” at a rate of 0.8 Mhz which is much to low for a gameboy which clocks at 4 MHz and I was about to give up. But after some searching the internet I stumbled upon Task-Types. So what if i remove every multi-threading safety net that the task-type might have and reduce async / await to the statemachine that it is at its heart. And the results looked very promising. I was able to increase the awaiting rate to 30 Mhz. I know this classes should not be used outside of this scope but the emulator code can be much more elegant if you can just await the next clock cycle and continue your work! I guess I can increase the clockspead even further by reducing the allocations.

Just in case anyone wants to see the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
[AsyncMethodBuilder(typeof(AsyncStateMethodBuilder<>))]
struct AsyncState<T> : INotifyCompletion
{
     public static AsyncState<T> Completed { get; } = new AsyncState<T>(default(T)); // Can be used to prevent creating of new objects every time...

     private Action fContinuation;
     private T fResult;

     public bool IsCompleted { get; private set; }
     public T GetResult() => fResult;

     public AsyncState(T value)
     {
         fContinuation = null;
         fResult = value;
         IsCompleted = true;
     } 

     void INotifyCompletion.OnCompleted(Action continuation)
     {
         fContinuation += continuation;
     }

     public void SetResult(T value)
     {
         fResult = value;
         IsCompleted = true;
         fContinuation?.Invoke();
     }

     public AsyncState<T> GetAwaiter() => this;
}

struct AsyncStateMethodBuilder<T>
{
     private AsyncState<T> fTask;
     public static AsyncStateMethodBuilder<T> Create() => default;

     public void Start<TStateMachine>(ref TStateMachine stateMachine)
         where TStateMachine : IAsyncStateMachine
     {
         stateMachine.MoveNext();
     }

     public void SetStateMachine(IAsyncStateMachine stateMachine)
         => throw new NotImplementedException();

     public void SetResult(T result) => fTask.SetResult(result);
     public void SetException(Exception exception) => throw exception; // Wow this is dirty...
     public AsyncState<T> Task => fTask;

     public void AwaitOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
         where TAwaiter : INotifyCompletion
         where TStateMachine : IAsyncStateMachine
     {
         awaiter.OnCompleted(stateMachine.MoveNext);
     }

     public void AwaitUnsafeOnCompleted<TAwaiter, TStateMachine>(ref TAwaiter awaiter, ref TStateMachine stateMachine)
         where TAwaiter : ICriticalNotifyCompletion
         where TStateMachine : IAsyncStateMachine
     {
         awaiter.OnCompleted(stateMachine.MoveNext);
     }
}

After some tests I realized that the main performance gain comes from the fact that the AsyncState is a struct which means no heap allocations. But this comes at a drawback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
private AsyncState<int> DoSomething() { /*...*/ }

public async void ThisWorks()
{
    await DoSomething();
}

public async void ThisDoesNotWork()
{
    var tsk = DoSomething();
    await tsk;
}

Due to the fact that the AsyncState is a struct it is passed by value and not by reference which will lead to unexpected behavior if it is stored anywhere… Because the MoveNext is bound to the wrong awaiter and the program will never continue!