Blog ⋅ Peter Nelson

Sunday afternoon hack: Pokémon Blue

by peterdn 23. February 2014 19:36

Inspired by the recent Twitch Plays Pokémon phenomenon, I thought it would be interesting to delve into some old school ROM hacking with the first generation Pokémon games. In particular, Pokémon Red and Blue are already relatively well documented by the ROM hacking community and have several infamous bugs, technical analysis of which reveal useful information about the game internals. Exploiting this existing knowledge means we can cheat a little and not get bogged down in the drudgery of analysing a ROM from scratch.

Required Tools

We will require:

  1. Gameboy Color emulator and debugger
  2. Hex editor
  3. Pokémon ROM

On Windows, bgb appears to be one of the best choices of emulator as it features a built in debugger. This includes disassembly, register, and memory views (shown below). The hex editor I’m using is WinHex. As for the game ROM, it goes without saying that you should own a physical copy of the game before you download a ROM.

For this guide I am using Pokémon Blue, however I believe it should also work with Red.

The Goal

Our goal will be to change one of the starting Pokémon, Bulbasaur, into something a bit more interesting: Mew (yes, Mew is built into the Generation 1 games). This should prove to be non-trivial but easy enough to do in an afternoon. In addition, choosing your character’s first Pokémon occurs after 30 seconds of gameplay meaning we can test the hack without needing to play for hours beforehand.


Load up the emulator. Let’s take a snapshot of the game at the point just before we choose a Pokémon so we don't have to navigate the dialogue every time. For readers unfamiliar with the game, the left-hand screenshot below shows the desired game state. Visible are 3 Pokemon the player must choose from on a desk: Charmander, Squirtle, and Bulbasaur (left to right). We are attempting to change the rightmost from Bulbasaur to another Pokémon.

Note that when the player selects a Pokémon, some information about it is displayed in the form of a Pokedex entry, shown in the right-hand screenshot below:

clip_image002 clip_image004

Technical Background


Before we begin, it is necessary to have a rudimentary understanding of the console’s internals. The Gameboy Color’s CPU is an 8-bit modified Zilog Z80 with a 16-bit address bus meaning it can access 65,536 byte addresses. To overcome this 64KB limitation (especially considering the Pokemon ROM is 1MB, for example), memory bank switching is used. The first 16KB of this address space (actually slightly less; from address 0x0100-0x3FFF) is mapped to the first 16KB of the cartridge ROM. This area is referred to as ROM bank 0. The next 16KB of address space (from 0x4000-0x7FFF) can be mapped to any other 16KB bank within the ROM.

It is important that we are able to convert between ROM addresses and internal addresses. ROM bank X extends from ROM address X * 0x4000 to (X + 1) * 0x4000 - 1. As mentioned before, this is then mapped to internal addresses 0x4000-0x7FFF. Therefore, to find the ROM address corresponding to a particular bank and internal address, we use the following conversion formula:

ROM address = (internal address - 0x4000) + bank * 0x4000

And conversely:

Internal address = ROM address % 0x4000 + 0x4000

bank = ROM address / 0x4000

Pokémon ROM

Internally, Pokémon are uniquely identified by index numbers which are completely distinct from the Pokédex numbers that readers may be familiar with. For example, while Bulbasaur has a Pokédex number of 1, its index number is 0x99. From this article (which is a recommended read in itself), we know that there exists a lookup table at address 0x41024 in the ROM that maps index numbers to corresponding Pokédex entry numbers. For example, the byte at ROM address 0x41024 + 0x99 = 0x410BC has a value of 1 and therefore represents the mapping for Bulbasaur.


We begin by considering how a reasonable implementation of the game might work. Intuitively, we might expect that somewhere in the ROM are 3 bytes, one for each Pokémon on the desk, containing that Pokémon's index number. We would then expect that once a Pokémon is selected by the player, the aforementioned lookup table will be consulted to find the Pokédex index for that index number in order to display the corresponding Pokédex entry (e.g. for Bulbasaur as shown in the screenshot above). If we can detect when this value is read in memory, perhaps we can find the location of where the original index is stored.

Luckily, bgb supports memory access breakpoints. Recall that the byte at ROM address 0x410BC contains the Pokédex number for Bulbasaur. Using our conversion formula we find this should be mapped to internal address 0x410BC % 0x4000 + 0x4000 = 0x50BC. This is where we set our breakpoint, triggered on a memory read:

We then resume our game and select Bulbasaur. After a few seconds, the debugger breaks. In the memory view, we check to confirm that the correct bank (0x410BC / 0x4000 = 0x10) is selected. It is -- success! The bgb debugger at this point is shown below:

Now we take a look at the disassembly view. The debugger is paused on the instruction that triggered the memory access breakpoint (highlighted in blue). This instruction reads the byte at address held in register HL and writes it to register A. We can verify that the HL register contains the address 0x50BC using the register view at the top right.

We can see this looks like a function that begins at address 0x5010:

push  bc
push  hl
ld    a, (D11E)
dec   a
ld    hl, 5024
ld    b, 00
ld    c, a
add   hl, bc
ld    a, (hl)
ld    (D11E), a
pop   hl
pop   bc

In line 3, a byte is read from address 0xD11E and added to the constant 0x5024---which looks suspiciously like the base address of the index number lookup table---to result in the address 0x50BC. Aha! Lets follow that clue to address 0xD11E and see what we find there. Indeed we find that the byte at that address contains the value 0x99 (highlighted grey in the memory view):

Address 0xD11E is in an area of internal RAM, therefore it is not present in the ROM and must have been written by the game program. So where does it get written from originally? Let’s add an on write access breakpoint to find out. After reloading our saved state and selecting Bulbasaur again, we break this time at address 0x5136 (ROM bank 7). The disassembly at that location looks like:

Aha, that instruction at address 0x512F looks exactly like what we want -- the index number appears to be a hardcoded operand! Lets open up the ROM in our hex editor and modify the corresponding byte. Using our conversion formula, ROM address = (0x512F – 0x4000)+ 7 * 0x4000 = 0x1D12F. As the instruction is 2 bytes, the operand is actually at address 0x1D12F + 1. We change it to 0x15 – the index number of Mew:

Now we save the file and restart the game. After a few warnings about invalid checksums, we have success!


What next?

Here follows a non-exhaustive list of things to try:

  • Change the other 2 starting Pokémon.
  • Change the Prof. Oak dialogue (notice that he still refers to Bulbasaur, so this text must be located somewhere else).
  • Change the Pokémon starting levels.
  • Give the starting Pokémon more HP/Attack Power/moves (hint: read).
  • Change the Pokémon's graphics.

Happy hacking!

Tags: ,


An Introduction to ARM NEON

by peterdn 3. January 2014 21:07

NEON is ARM’s take on a single instruction multiple data (SIMD) engine. As it becomes increasingly ubiquitous in even low-cost mobile devices, it is more worthwhile than ever for developers to take advantage of it where they can. NEON can be used to dramatically speed up certain mathematical operations and is particularly useful in DSP and image processing tasks.

In this post I will show how it can improve a simple image processing algorithm and will compare it to several other approaches. My particular implementation will be written for Windows Phone 8, however the principles should be platform agnostic and therefore easily applicable to Android and iOS as well.

Testbed application: Sepiagram

We will use NEON to speed up an exciting image processing task: applying sepia tone to a photograph. The basic algorithm given in this blog post will serve as a starting point. Output RGB values are a simple linear combination of input RGB values, with red and green given higher weights than blue in order to give the image a yellowish hue:

outputRed = (inputRed * 0.393) + (inputGreen * 0.769) + (inputBlue * 0.189) 
outputGreen = (inputRed * 0.349) + (inputGreen * 0.686) + (inputBlue * 0.168) 
outputBlue = (inputRed * 0.272) + (inputGreen * 0.534) + (inputBlue * 0.131)

The app skeleton is as equally simple, consisting of an image view, image chooser, and several buttons for applying our various implementations of the sepia tone algorithm. I’ve also included some timing code that allows us to very roughly compare the performance of each approach.

Naive C# implementation

Our first implementation in C# is as straightforward as one might expect:

public static void ApplySepia(WriteableBitmap Bitmap)
    for (var i = 0; i < Bitmap.Pixels.Length; ++i)
        var pixel = Bitmap.Pixels[i];

        // Extract red, green, blue components.        
        var ir = (pixel & 0xFF0000) >> 16;
        var ig = (pixel & 0xFF00) >> 8;
        var ib = pixel & 0xFF;
        // Apply the transformation.
        var or = (uint)(ir * 0.393f + ig * 0.769f + ib * 0.189f);
        var og = (uint)(ir * 0.349f + ig * 0.686f + ib * 0.168f);
        var ob = (uint)(ir * 0.272f + ig * 0.534f + ib * 0.131f);

        // Saturate the result.
        or = or > 255 ? 255 : or;
        og = og > 255 ? 255 : og;
        ob = ob > 255 ? 255 : ob;

        // Write the resulting pixel back to the bitmap.
        Bitmap.Pixels[i] = (int)(0xFF000000 | or << 16 | og << 8 | ob);

The only thing that might be immediately non-obvious is how we must saturate the output results. As we’re working with 32-bit integers and the color weights total to more than 1, it’s possible for the output values to be greater than 255. We therefore cap (saturate) them to 255.

On my Lumia 1020 this implementation takes around 600-800ms for a 3072x1728 pixel image. Not bad, but we can do much better.


Naive to native: C++

It turns out that floating point operations are slow1 and an immediate improvement can be made by modifying the algorithm slightly to process integers instead of floats. We can scale up each color weight by some factor, apply the transformations, then scale the results back down. As our RGB values are integers between 0 and 255 anyway, this completely eliminates any need for floating point arithmetic. We will choose 1024 as our scale factor as the color weights are given to 3 decimal places (so we lose no precision) and it is a convenient power-of-2. Hence, we can rewrite the equations as:

outputRed = ((inputRed * 402) + (inputGreen * 787) + (inputBlue * 194)) / 1024
outputGreen = ((inputRed * 357) + (inputGreen * 702) + (inputBlue * 172)) / 1024
outputBlue = ((inputRed * 279) + (inputGreen * 547) + (inputBlue * 134)) / 1024

In C++, this becomes:

void NEONRT::NEONRT::IntegerSepia(const Platform::Array^ image)
    uint32_t *px = (uint32_t *)image->Data;
    uint32_t *end = px + image->Length;

    for (; px < end; ++px) {
        // Extract red, green, blue components.
        unsigned int ir = (*px & 0x00FF0000) >> 16;
        unsigned int ig = (*px & 0x0000FF00) >> 8;
        unsigned int ib = (*px & 0x000000FF);

        // Apply the transformation.
        unsigned int or = (ir * 402 + ig * 787 + ib * 194) >> 10;
        unsigned int og = (ir * 357 + ig * 702 + ib * 172) >> 10;
        unsigned int ob = (ir * 279 + ig * 547 + ib * 134) >> 10;

        // Saturate the result.
        or = or > 255 ? 255 : or;
        og = og > 255 ? 255 : og;
        ob = ob > 255 ? 255 : ob;

        // Write the resulting pixel back to the bitmap.
        *px = 0xFF000000 | (unsigned int)or << 16 | (unsigned int)og << 8 | (unsigned int)ob;

Despite the fact we have 3 more divisions per pixel in this version than the original, this implementation takes around 300-400ms on the same image already a 2x speed increase.

Utilizing NEON

To briefly recap: NEON is a single instruction multiple data (SIMD) architecture meaning it can perform the same arithmetic operation on multiple data values in parallel. It has 32x 64-bit registers, named d0-d31 (which can also be viewed as 16x 128-bit registers, q0-q15)2. These registers are considered as vectors of elements of the same data type. NEON arithmetic instructions typically distinguish between these data types in order to apply the same operation to all lanes. For example, vadd.f32 considers a 64-bit register as 2x 32-bit floats, whereas vadd.i8 considers it as 8x 8-bit signed integers. For a more substantive description, please see the official ARM documentation.

Back to our sepia color transformation equations.

On a conventional single instruction single data machine, the sepia algorithm requires a total of 9 multiplications and 6 additions per pixel. Using NEON SIMD instructions, however, we can operate on vectors all in one go. We can rewrite the above formula in terms of vector multiplications and additions:

[outputRed  outputGreen  outputBlue] = inputRed   * [0.393  0.349  0.272]
                                     + inputGreen * [0.769  0.686  0.534]
                                     + inputBlue  * [0.189  0.168  0.131]

This requires only 3 vector multiplications and 2 vector additions per pixel. In fact, NEON includes a vector multiply and accumulate instruction which simultaneously performs a vector multiplication and addition. Using 1 multiply and 2 multiply-accumulates, we can reduce the total number of operations to 3.

We will walk through one iteration of a loop that processes multiple pixels at a time. We have a 32-byte chunk of memory containing 8x 32-bit pixels. These pixels are further subdivided into 4x 8-bit subpixels – alpha, red, green and blue (ARGB). Each block in the figure below represents 1 byte:


NEON allows us to load 32 bytes using a single opcode:

vld1.u32 { d20, d21, d22, d23 }, [r0]

This has the effect of loading 2 pixels into each of the registers d20-d23. Register r0 is a pointer to the 8-pixel block of memory within the bitmap. Now we have:


It should be immediately obvious that we cannot simply multiply in-place by our weights as each subpixel’s value will probably overflow past 255. Therefore we must extend each subpixel to 16 bits:

vmovl.u8 q0, d20
vmovl.u8 q1, d21
vmovl.u8 q2, d22
vmovl.u8 q3, d23

Note the .u8 suffix on these instructions tell NEON to treat the input registers (d20-d23 in this case) as vectors of 8-bit values (i.e. the subpixels). This is an important distinction as our output would be structured differently if, for example, we used the similar vmovl.u16 instruction3. Now the two pixels contained in 64-bit-wide d20 are extended and copied to the 128-bit-wide q0, and similarly for d21-d23 and q1-q3. Lets now consider only the first pixel, which is contained in register d0:


Again, 2 bytes per subpixel is not enough to guarantee we won’t overflow: considering that we are multiplying each subpixel (max 2^8) by its combined weights (> 1) and also by 1024 (2^10), we need at least 18 bits to represent the intermediate values before we again divide by 1024. Luckily, NEON provides vector multiply and multiply-accumulate instructions that automatically widen their outputs. We therefore perform our three vector arithmetic operations, starting with a multiply:

vmull.u16 q4, d16, d0[2]


Here q4 is the destination register, d16 contains our red color weights and d0 contains the pixel we are currently operating on. Note that this form of the vmul instruction takes a 16-bit scalar as its 3rd argument and hence we select the red subpixel by subscripting d0. Next, we multiply by the green weights, contained in d17, and accumulate into the destination register:

vmlal.u16 q4, d17, d0[1]


And then similarly for the blue weights, contained in d18:

vmlal.u16 q4, d18, d0[0]

After repeating for the other 7 pixels (using a unique q register for each, we have 16 after all!), we then perform a right shift by 10 to divide by 1024, narrow and saturate:

vqshrn.u32 d0, q4, 10

Here d0 is the destination register, q4 still contains our first pixel, and 10 is a constant. This is the saturating form of the shift-right instruction meaning it will cap output values to 2^16 - 1 if they would otherwise overflow their smaller destinations. After repeating another 7 times, we perform the final saturating narrow to fit our pixels back into 4-bytes:

vqmovn.u16 d0, q0


Finally, we set the alpha value to 255 using a bitwise OR and write the results back out to the bitmap:

vorr d0, d0, d19
vst1.32 { d0, d1, d2, d3 }, [r0]

Here d0-d3 contain our 8 pixels, d19 contains the constant 0xFF000000FF000000, and r0 is still a pointer to the 8-pixel block in the bitmap.

What else?

The rest of the assembly routine is mainly concerned with setting up the registers containing the color weights, and looping. A link to the full source code of the app can be found at the end of this post.


The resulting NEON implementation takes roughly 80-100ms – an impressive difference, considering the assembly routine is probably not particularly optimized.

To summarize, the results on my Lumia 1020 for each implementation are as follows:

Implementation Time/ms (average of 10 runs)
C# (floating point) 697
C++ (integer) 228

Why not NEON intrinsics?

NEON intrinsics are C/C++ macros that compile down to single NEON instructions, at least in theory. In practice I have found that modern compilers still produce downright awful NEON code. My first implementation of the sepia algorithm using NEON intrinsics in C++ performed worse than the naive C# implementation. Looking at the generated assembly, the compiler seems to love pointlessly4 spilling registers into RAM which drastically reduces performance. Thinking that the compiler simply wasn’t intelligent enough to unroll each loop, I manually unrolled them, keeping the theoretical effect of the code exactly the same. The algorithm then did run much faster, but also completely incorrectly, producing vertical blue lines in the output. I didn’t have the time or will to work out why, but either it’s a bug in the Visual C++ compiler or I’m breaking some rules somewhere that I don’t yet know about4.

Final thoughts

  • NEON can dramatically improve performance of algorithms that can take advantage of data parallelism.
  • Compiler support for NEON is still terrible.
  • It’s not much more difficult to write assembly than struggle with NEON instrinsics. You can beat the compiler.
  • It’s can be worth converting your algorithm to avoid floating point operations1.
  • Don’t make any conclusions about performance without thorough benchmarking.

Source code

Download Visual Studio 2013 project.


  1. Not really true in general. Whether floating point operations are slower than integer operations depends on a vast number of factors. From what I’ve experienced on ARM, however, it is often the case.
  2. NEON actually shares its registers with the floating point unit, if one exists.
  3. vmovl.u8 will transform 0xFFFFFFFF to 0x00FF00FF00FF00FF, whereas vmovl.u16 will transform it to 0x0000FFFF0000FFFF.
  4. There may be a good reason I don’t know about, such as certain registers being expected to be preserved across function calls. I’d have thought the compiler would deal with this, however...

Tags: , , ,

ARM | Windows Phone 8

‘Hello World!’ in ARM assembly

by peterdn 14. January 2012 22:29

Over the last few weeks, in an effort to port a small C library to the platform, I’ve been doing a fair bit of tinkering around with the Android NDK.  The NDK is primarily intended to allow Android developers to write performance-critical portions of their apps in native C or C++, which interface with the Android Java API through JNI.  As the C library in question required porting some x86 SIMD assembly, I figured it would be helpful for me to get to know the bare bones of the ARM architecture.  As a means to this end, we can use the NDK’s cross-compiler as a standalone tool to write a simple ‘Hello World!’ console “app” in ARM assembly.  As Android is effectively Linux under the hood, we can apply our x86 Linux assembly programming skills to the ARM platform.

First things first: ‘Hello World!’ in x86 assembly

Briefly, the method involves invoking system calls by talking directly to the underlying Linux kernel.  An example of how to do this in x86 assembly is given here.  As ARM uses a different ABI to x86 (registers are named different things, for a start), this code needs a tiny bit of modification.

Firstly, notice that system calls are identified numerically – in the x86 example, #1 refers to exit() and #4 refers to write().  To invoke a system call, we put its identifier in the EAX register, pass (up to 6) arguments in EBX, ECX, EDX, ESI, EDI, EBP, respectively, and interrupt 0x80 is generated.  This is described in further detail here.  In contrast, the ARM ‘EABI’ calling convention uses a different method which is described vaguely in these patch notes.  We can glean that, on ARM, the system call identifier is put in register R7, arguments are passed in R0-R6 (respecting “EABI arrangement” where appropriate, i.e. 64-bit arguments), and the kernel is called with the ‘SWI 0’ instruction. 

Secondly, as they are not guaranteed to be the same on each platform, we must look up the system call identifiers for exit() and write().  For this we refer to the Linux kernel source – $LINUX_SOURCE_ROOT/arch/arm/include/asm/unistd.h, specifically.  As it turns out, these two system calls do have the same identifiers on both x86 and ARM platforms.


So our assembly code, in GAS syntax, looks very much like (see inline comments for details):


    .ascii      "Hello, ARM!\n"
len = . - msg


.globl _start
    /* syscall write(int fd, const void *buf, size_t count) */
    mov     %r0, $1     /* fd -> stdout */
    ldr     %r1, =msg   /* buf -> msg */
    ldr     %r2, =len   /* count -> len(msg) */
    mov     %r7, $4     /* write is syscall #4 */
    swi     $0          /* invoke syscall */
    /* syscall exit(int status) */
    mov     %r0, $0     /* status -> 0 */
    mov     %r7, $1     /* exit is syscall #1 */
    swi     $0          /* invoke syscall */


Save the above as hello.S and run it through the GNU cross-assembler provided with the NDK.  I will assume that you have the prebuilt NDK toolchain directory in your PATH (in my case here, /Users/peterdn/android-ndk/toolchains/arm-linux-androideabi-4.4.3/prebuilt/darwin-x86/bin):

arm-linux-androideabi-as -o hello.o hello.S
arm-linux-androideabi-ld -s -o hello hello.o

Deploying to Android

For many, the easiest way to test the above binary is by deploying it to an Android device with an ARM processor.  This also means we can take advantage of the insanely useful ‘adb’ tool.  If you happen to be running bog-standard Linux on an ARM device, the binary should still run, providing your kernel supports the newer EABI (I believe 2.6.15 and above).

To deploy and test on Android, simply run:

adb push hello /data/local/tmp/hello
adb shell /data/local/tmp/hello

It is also possible to run the binary locally on your device using the Android Terminal Emulator, as below:

Android Terminal Emulator screenshot


Tags: , , , ,

Android | NDK