How Duke Nukem II’s parallax scrolling worked

Parallax scrolling – creating an illusion of depth in a 2D scene by having the background and foreground move at different speeds – is pretty much a staple of platformers and other 2D games nowadays. Popularized by the arcade game Moon Patrol in 1982, the effect was already quite common on arcades and home consoles by the early 90s. But PC games were a different story. Among the side-scrolling DOS games released from 1990 to 1993, very few have parallax scrolling (by 1993, it becomes a little more frequent, but still rare). There are also some very early examples, like the PC port of the aforementioned Moon Patrol from 1983, but it features only an outline of a background, not a graphical image. So by having this feature, Duke Nukem II stands out among Apogee’s catalog of platformers, along with its predecessor Duke Nukem from 1991 and the 1992 game Cosmo’s Cosmic Adventure (which shares a lot of code and file formats with Duke 2).

Duke Nukem II’s parallax scrolling in action

It’s worth noting that the vast majority of early Apogee games targeted EGA, to make the games accessible for people with older hardware (VGA is backwards compatible with EGA). Early DOS games featuring parallax scrolling were typically VGA games, and that’s for good reasons, as we’ll see.

Duke Nukem is kind of a hybrid. It requires a VGA card and features some 256-color scenes, but all the gameplay itself is using a 16-color EGA mode, just using the VGA palette to achieve custom colors that aren’t possible on EGA. But the rendering engine itself is still fundamentally EGA-based, and the game does in fact run on EGA cards – just with incorrect colors.

Before we dive into Duke Nukem’s implementation of the effect, let’s take a look at why parallax scrolling was generally hard to do in EGA games.

What made parallax scrolling hard?

What does parallax scrolling require, on a technical level? Conceptually, it’s quite simple: We need to draw the background separately from the foreground, and use different scroll offsets for each. On consoles and arcades, this was typically achieved via hardware acceleration, where the programmer can set up multiple image layers and the hardware takes care of compositing these together into the final output (unlike the SNES, the NES didn’t have hardware layers, but it still had some hardware features that made parallax scrolling a bit easier to implement). Without hardware support, the compositing has to be done entirely in software, i.e. creating the combined image in the graphics card’s framebuffer. The simplest way of doing that is to simply draw the background first, and then draw everything else on top, overwriting the parts of the background that are obscured.

There’s two issues with that when targeting EGA cards. The first problem is that early PCs were not quite fast enough to redraw the entire screen each frame while maintaining full framerate (i.e. native display refresh rate). PCs were primarily designed for office work and business applications, where fast & smooth graphics weren’t a priority (once graphical user interfaces and higher resolutions became popular though, the need for faster graphics increased, leading to the creation of the VLB bus). PCs had faster CPUs compared to gaming consoles, but that speed didn’t help much when all the pixels had to be squeezed through the slow ISA bus. Sound cards, hard drive controllers, and other devices at least had DMA support, which meant data could be transferred between system RAM and the hardware device without the CPU’s involvement. But EGA/VGA graphics cards didn’t support that. So the CPU had to “get its hands dirty” and manually copy all the pixels from main memory to video memory. There was also a way to copy from video memory to video memory (latch copy – more on that later), but it still required the CPU to manually drive the process.

How bad was it, exactly? The theoretical maximum bandwidth of the ISA bus is 8 MB/s, which is more than enough to push a 320×200 image with 4 bits per pixel (16 colors) at 70 Hz – that only requires about 2.2 MB/s. But what does this mean in practice? In order to get some concrete numbers, I wrote a benchmark and ran it on a few DOS-era machines. Here are the results for drawing a single full-screen image to the framebuffer, in the (to the best of my knowledge) most optimal way possible:

CPUGraphics cardTime in msAchievable FPS
80486 DX2 @ 66 MHzParadise Autoswitch EGA2 (EGA, 8-bit ISA)7214
80486 DX2 @ 66 MHzbit-design PCELC V1.3 (EGA, 8-bit ISA)5419
80286 @ 16 MHzCirrus Logic CL-GD5422 (16-bit ISA)1190
80386 DX @ 40 MHzCirrus Logic CL-GD5420 (16-bit ISA)9111
80486 DX2 @ 66 MHzCirrus Logic CL-GD5422 (16-bit ISA)10100
80486 DX2 @ 66 MHzTseng Labs ET4000/W32 (VLB)3333

Two EGA cards are first, and their performance is really really bad. This is just displaying a single full-screen image, on a really good CPU for the time, and we’re already down to a very low framerate. This is surprisingly bad, and I wonder how much this is because these cards only have 8-bit ISA interfaces, cutting the available bandwidth in half, and how much it is due to the hardware itself being less efficient. I don’t have any VGA cards with an 8-bit ISA interface, so I don’t know how they stack up, unfortunately. Next, we have two 16-bit ISA VGA cards tested with a range of different CPUs. These cards aren’t doing too badly overall, but we have to keep in mind that we are still only displaying a single full-screen image. A real game also needs to draw sprites and the game world, read inputs, run game logic etc. At 11 ms needed just for drawing a single screen, there’s only 3 ms left for everything else if we want to hit 70 FPS, and that seems practically impossible on these machines. So although it’s a much better situation than on the EGA cards, it’s still not great, and a lot of work was needed to get good performance in games. What’s also interesting to see here is how little of a difference the CPU makes. In terms of raw power, there’s a vast gulf between the 16-MHz 286 and the 66 MHz 486, which not only has a higher clockspeed but also features a cache and other architectural improvements. But the difference in graphics performance is relatively minor. This really illustrates what a bottleneck the ISA bus was. And if that wasn’t clear before, it becomes absolutely crystal-clear when we look at the final result, which is using a VLB graphics card: It’s about three times as fast as the ISA graphics card. This is a much better situation for a game developer, but VLB only arrived in 1992 and was mostly found on higher-end 486 motherboards. So not everybody would’ve owned a system of this performance level, and making your game run well on regular ISA-based cards would make it attractive to a wider audience.

As we can see from these results, doing a naive implementation of parallax scrolling, where we first draw the background and then draw the rest of the game on top of it, would not be possible at a fast, smooth frame rate on pre-VLB graphics cards. It was already a challenge to maintain good rendering speed without parallax scrolling in the mix. Because of these speed constraints, many games opted to only redraw the parts of the screen that had changed since the last frame. But these optimization techniques were at odds with parallax scrolling, since it requires large parts of the screen to constantly change.

Alright, so let’s say we accept a lower frame rate, maybe 35 instead of 70. Or maybe just 20 to 25, it wasn’t uncommon for games at that time to run at these low frame rates, after all. So would that work? Well, this is where the EGA throws us a curve ball and makes things even more difficult.

The EGA’s planar memory layout

Original IBM EGA card. Photo from Wikimedia, created by user Vlask, licensed under CC BY-SA 3.0

EGA uses a color palette of 16 colors. This means that a pixel’s value in the framebuffer doesn’t directly represent a color, but an index into the palette, which then defines the actual color. When scanning out the framebuffer in order to generate a signal for the monitor, the graphics card’s hardware automatically converts these palette indices into color values, on the fly. We only need 4 bits to store a value between 0 and 15, so we can effectively store two pixels in a byte. So intuitively, we would expect the framebuffer to be a linear sequence of bytes, with each byte representing two consecutive pixels, right? Well, that’s not at all how EGA works.

Linear vs. planar memory

Instead, the data is distributed across 4 so-called planes. The first plane stores all the first bits of all pixels, the 2nd plane stores all the 2nd bits of all pixels, etc. When looking at an individual plane, each byte thus represents 8 consecutive pixels, but only one of the 4 bits for each of those pixels.

(Why the heck was it done this way? Memory chips at the time were too slow to be able to fetch the pixel data quick enough to drive the screen at 60 or 70 Hz. By splitting the data into planes, 4 bits could be read in parallel and then combined, and that was fast enough).

What’s more, the CPU can only ever access one of the planes at a time. A 320×200 framebuffer with 4 bits per pixel is 32,000 bytes in total, but the CPU can only see a 8000-byte window via the memory mapped video ram. It has to essentially do bank-switching in order to access the entire data. This bank switching is done by setting a hardware register on the graphics card via port I/O (using the OUT assembly instruction). So in order to copy an image to the framebuffer, we need to:

  1. Choose plane 0 via port I/O
  2. Write all the data for plane 0
  3. Choose next plane via port I/O
  4. Repeat until all 4 planes have been written

As you can imagine, converting image data from linear to planar format wasn’t something you wanted to do on the fly while drawing the images, so games of this era typically stored their image data already in planar layout. Duke Nukem II is no exception there.

But not only does the memory layout dictate our file formats, it also has big implications on how drawing images onto the framebuffer works. So let’s say we’ve already filled the framebuffer with a background, and now we want to draw something on top. The simplest case is drawing an image with a width that’s a multiple of 8, at a position with an x coordinate that’s also a multiple of 8. In that case, everything lines up nicely, and we can simply copy the sprite’s pixels byte for byte, one plane after another. We do have some performance overhead due to switching between the planes, but we only need to copy as many bytes as there are in the source image.

Aligned write to EGA memory (click to enlarge)

If we want to move this sprite to the right by 3 pixels, things become more difficult. Now we need to target individual bits within a byte, since we’re not addressing the start of an 8-pixel block anymore:

Unaligned write to EGA memory (click to enlarge)

The good news is that the EGA hardware has some supporting features that make this a little easier. Changing individual bits in a byte requires some bitwise operations: We need to shift the source data so that it’s aligned with the target bits, we need to apply a bitmask so that we only overwrite the right bits, then combine the source data with the target data, and finally write it back. For our example of a 3-pixel offset, this would be something like the following for the first byte of data (in C code):

uint8_t mask = 0xE0; // 0b11100000
*target = (*target & mask) | (*source >> 3);

The EGA can handle all of these bitwise operations for us, taking some burden off of the CPU. But we still need to first read the target data before writing the source data. Reading from video memory causes the EGA to place a copy of the data that was read into internal storage called a latch register. When writing data to the EGA, the written data can be bit-shifted, masked, and combined with the data in the latch register before actually writing it to video memory.

The bad news is that we still have to set up the EGA’s machinery to do the right thing, i.e. we need to figure out the right bitmask and shift values for the position that we want to draw at, and do the port I/O required to configure the hardware accordingly. And, we also need to do a read from video memory to fill up the latch register before each write. Finally, due to being out of alignment with the byte addresses we now need to write more bytes than before, since 8 source image bits (one byte of data) now have to be distributed across two bytes of video memory address space. So for a 16-pixel wide image, we now need to write and read at least 3 bytes instead of just writing 2. And not to forget that everything we’ve just discussed still has to happen 4 times, once for each plane. All of this adds quite a bit of complexity to our code, and negatively impacts performance.

Even if we restrict ourselves to drawing only at multiples of 8, we still run into similar complications as soon as we want to draw images with a width that’s not a multiple of 8, or if we want to draw a portion of the source image starting at a non-multiple-of-8 offset. The latter is exactly what we would need to do in order to make a background image scroll smoothly – in other words, what we need for parallax scrolling.

It’s worth noting that all of these complications only exist on the horizontal axis. Vertically, we are always dealing with byte addresses, so it’s much easier. I could imagine that that’s the reason why Major Stryker, a vertically scrolling shooter from Apogee released in early 1993, features multiple layers of parallax despite targeting EGA.

Earlier, I also said that VGA games more frequently featured parallax, and that VGA was easier to program than EGA. This is primarily because in 256-color VGA mode, each pixel always occupies one byte. So all the complexity caused by the need to address individual bits disappears. VGA still has a planar memory layout, but this applies to bytes instead of bits.

Getting back to Duke Nukem, it seems that the developers looked at the complicated mess that is the EGA, and said “no thanks, I’m not dealing with any of that”. The game is restricted to operate only on an 8×8 pixel grid, which sidesteps most of the complications (there’s one exception: particle effects are drawn as individual pixels and can move freely). This approach was already taken with the first Duke Nukem game, and was then kept for Cosmo’s Cosmic Adventure and later Duke Nukem II. Of course, the question is, how did they achieve parallax scrolling under these constraints? We’ll see in a bit, but first let’s look at how drawing the world works in general.

Drawing the world and sprites

Levels (maps) in Duke Nukem II are built out of tiles, like in most platform games. Tiles can appear in front of or behind sprites representing Duke, enemies and other objects. Some tiles are partially transparent (also called “masked”). They can be placed on top of other tiles (with some restrictions), or just on their own to have the background show through.

The game redraws the entire world every frame, i.e. backdrop, tiles, and sprites. The drawing code is designed around avoiding overdraw (drawing to the same pixel position multiple times) as much as possible, to reduce the amount of video memory writes needed.

Map tiles are drawn first, row by row from the top left of the screen down to the bottom right in a grid of 8×8 pixel blocks. For each grid cell, either a tile is drawn, or a part of the background, or both in case there’s a masked tile. Solid (non-masked) tiles that appear in front of sprites are also already drawn at this point – this may seem strange, but we’ll get to why in a moment. The game’s camera/viewport only scrolls in 8-pixel steps, so the map tiles are always aligned with the grid.

Tile layout of Duke’s sprite

After drawing the tiles and background, it’s time to draw the sprites. The sprite graphics themselves are also arranged into groups of tiles, and are rendered in a similar way as the masked tiles are. So essentially, everything in Duke 2 is based on tiles, even the sprites (as mentioned above, particle effects are an exception to this).

When drawing a sprite, the game goes through its tiles row by row, from top left to bottom right. For each tile, it checks if it’s on screen (a sprite can be partially off-screen), and if the map tile at that location is meant to appear in front of sprites. If it’s not on screen or if the map tile should appear in front, that particular sprite tile is skipped. This approach means that sprites are never overdrawn by solid map tiles, thus minimizing the amount of data that needs to be pushed across the ISA bus to the graphics card. Since sprites can only be placed at locations on the tile grid, they can never appear in-between two map tile locations, so a sprite tile is always either fully visible or fully obscured. Sprites can also be partially obscured by foreground masked tiles, but that’s handled as if the sprite tile is fully visible, accepting a bit of overdraw in that case.

Example of a sprite being partially obscured by foreground map tiles. The left-most column of sprite tiles isn’t drawn, making it look like the sprite is behind the wooden column which was actually drawn before drawing the sprite.

With this system as just described, it’s already possible to implement a simple form of parallax: Keeping the background static/fixed while the map and sprites scroll. This is what the first Duke Nukem game does, but Duke 2 (and Cosmo) go a step further by making the backdrop scroll as well.

Making the backdrop scroll

As mentioned above, the map tiles (and sprites) always scroll in steps of 8 pixels. The backdrop, on the other hand, scrolls in 4-pixel steps. This difference in scrolling speed is what creates the parallax effect. But as we saw in the section on the EGA’s memory layout, drawing images with a 4-pixel offset is complicated. And even if we didn’t have the complications of the EGA hardware, it would still be tricky due to the way the backdrop graphics are organized, but more on that later. Either way, the authors didn’t bother creating dedicated drawing routines just for the backdrops. Instead, they used a little trick: During map loading, the game creates copies of the backdrop image with the pixels shifted up/left by 4.

Unmodified backdrop image
Copy of backdrop shifted left by 4 pixels

With these copies at hand, the game can now simply alternate between the regular and shifted versions on each scroll step. Concretely, it uses the shifted versions for odd camera positions, and the regular versions for even ones.

To scroll the backdrop past the initial 4 pixels, the starting column/row within the backdrop source image is changed. Basically, areas of the screen which show the backdrop act as tile-sized windows into the backdrop image, and they can show different parts of the image depending on the scroll position. The area of the screen used for gameplay is 256×160 pixels, which is 32×20 tiles. A backdrop image is 320×200, or 40×25 tiles. Let’s say the camera is at position 160, which is a multiple of 40. Thus rendering starts with the top-left tile of the unshifted version of the backdrop. Next, it switches to the shifted version, still starting at the top-left tile. After that, it goes back to the unshifted version, but now the left-most column of tiles on screen is showing the 2nd column of tiles from the backdrop image, skipping the first tile on each row, etc. The following animation illustrates this (click to enlarge):

The top-row of the two versions of the backdrop image is shown above the game’s output. The light blue rectangle indicates the portion of the backdrop image that’s displayed on screen.

Once the end of the backdrop image is reached, the next on-screen tiles after that reset back to showing the tiles from the beginning of the backdrop image, making the backdrop repeat:

The light blue rectangle shows the portion of the backdrop that’s shown in the left half of the image, the yellow rectangle shows the portion that’s used once the image starts repeating.

All of this works exactly the same way in the vertical dimension. For backdrops that can scroll vertically and horizontally, the game creates 4 versions of the backdrop in total: Unmodified, shifted left, shifted up, shifted up and left.

The game also features some levels where the backdrop scrolls permanently, independent of the camera position. This works generally the same way as the parallax scrolling, the only difference is that a counter value is used to determine the scroll offset instead of the camera position. The counter is incremented based on time elapsed. For the horizontal version of this auto-scrolling, the backdrop additionally scrolls in 2-pixel steps, not 4-pixel steps. The principle is still the same, but the game creates 4 versions of the backdrop in this case, which are shifted left by 0, 2, 4, and 6 pixels, respectively. These 4 images are then shown in sequence, before incrementing the starting tile column in the source image.

The techniques we looked at allowed the game to implement a form of parallax scrolling in a fairly efficient way, due to the design which prevents overdraw. But the developers did some additional work to improve performance. Let’s look at that next.

Making tile drawing fast

Tilesets as well as backdrop graphics are arranged in a way that makes drawing individual 8×8 pixel blocks fast. In bitmap image formats, the data is usually arranged by lines of pixels, so that all the pixels of the top-most line are stored first, then the 2nd line etc. But the backdrops and tilesets are instead arranged into tiles. First, 8 lines of 8 pixels each are stored, representing the top-left block of 8×8 pixels in the image. This is followed by 8 lines of 8 pixels representing the 2nd block, etc. Each 8-pixel line consists of 4 bytes, storing the 4 EGA bit planes for those 8 pixels. This arrangement makes it possible to seek to a desired tile position within the image, and then read 32 consecutive bytes to obtain all 8×8 pixels making up that tile.

When the game loads a level, the tileset and background are uploaded into video ram. This makes it possible to use a technique called latch copy. With this technique, the game can copy a tile from the backdrop or tileset by reading & writing just 8 bytes. Compared to the 32 bytes that would be necessary to copy a tile from main memory to video memory, that’s much quicker, and we don’t need to perform the port I/O to switch between planes either, which saves additional time.

We already came across the EGA’s latch register earlier, when discussing drawing images. What I didn’t mention, is that there are actually 4 latch registers, one for every plane. Every time the CPU reads a byte from video ram, it only gets the data for the currently selected plane, but internally, the graphics card loads the data for all 4 planes into the latch registers. And now it turns out that the hardware can be configured to ignore the data from the CPU during writes, and instead only use the values from the latches. This makes it possible for the CPU to read a single byte, which fills up the latches with the data for all 4 planes, and then write it back at a different address, which will store the data from the latches into all 4 planes at that target address. What value the CPU writes doesn’t matter, only the address needs to be correct. Thanks to this mechanism, we can copy 4 bytes for the cost of one byte.

For drawing sprites and masked tiles, the game is forced to copy all 4 planes separately from main memory, since it needs to do some bitwise operations in order to apply the transparency mask. But the tile and backdrop drawing makes up the bulk of the data written each frame, so by using this optimization, the game gets a significant speed boost. How much of a boost? I’ve also benchmarked this by drawing a screen full of tiles, both with and without using the latch copy technique. Here are the results:

CPUGraphics cardLatch copyRegular copySpeedup
80486 DX2 @ 66 MHzParadise Autoswitch EGA2 (EGA, 8-bit ISA)36 ms / 28 FPS86 ms / 12 FPS58 %
80486 DX2 @ 66 MHzbit-design PCELC V1.3 (EGA, 8-bit ISA)34 ms / 29 FPS87 ms / 11 FPS61 %
80286 @ 16 MHzCirrus Logic CL-GD5422 (16-bit ISA)20 ms / 50 FPS50 ms / 20 FPS60 %
80386 DX @ 40 MHzCirrus Logic CL-GD5420 (16-bit ISA)13 ms / 77 FPS32 ms / 31 FPS59 %
80486 DX2 @ 66 MHzCirrus Logic CL-GD5422 (16-bit ISA)12 ms / 83 FPS28 ms / 36 FPS57 %
80486 DX2 @ 66 MHzTseng Labs ET4000/W32 (VLB)5 ms / 200 FPS10 ms 100 FPS50 %

Aside from the VLB card which has a slightly lower speedup, the latch copy optimization improves performance by about 60 % in all cases. What’s interesting is that the EGA cards can actually draw faster when using this technique compared to just drawing the entire screen at once, whereas the latter is quicker on all the other cards. Also, we can now see a bigger impact from the CPU speed, with a clear improvement of 7 ms when going from the 286 to the 386. This makes sense, since the speed at which the CPU can trigger latch copies is now more relevant compared to raw bus bandwidth for the entire-screen-at-once case.

Wrapping up

Duke Nukem, in it’s first incarnation, was heavily designed around the EGA’s limitations: Fixing everything to a grid of 8×8 pixel blocks avoided the complications caused by unaligned EGA memory writes. Drawing the entire world as a grid of tiles, including the background and sprites, kept overdraw to a minimum, reducing the amount of bandwidth needed. It made it possible to achieve simple parallax scrolling in 1991, when very few DOS games did it. But there was a cost: Because everything moved in chunky 8-pixel steps, the update rate had to be kept fairly low, as the game would move way too fast otherwise. As we can see from the benchmarks, even a 286 is still capable of rendering a screen full of tiles at 50 FPS when the latch copy optimization is used. So it seems plausible that Duke Nukem could have achieved a decent 35 FPS. But this would’ve made the game too fast to play, so the framerate had to be reduced. Cosmo’s Cosmic Adventure and Duke Nukem II evolved the engine further, adding scrolling backgrounds with the help of some trickery. But the fundamental design necessitating the slow framerate stayed the same.

So was it worth it? Well, Duke Nukem and Cosmo were very successful, staying in the top 10 of the Shareware sales charts long after their release. I haven’t found detailed data for it yet, but I believe Duke Nukem II did very well too. At the time, the games were definitely appealing, both in terms of their polished and fast gameplay as well as their graphics. And it’s worth noting that the choppy framerate is less severe on a CRT monitor. But how well do they hold up? Compared to other games like Commander Keen which prioritized smooth scrolling over parallax, the low framerate and chunky scrolling in Duke Nukem/Cosmo can be uncomfortable. I personally still loved playing Duke Nukem II even many years later, but I know people who could never get into the game because the choppy presentation is too hard on their eyes.

Fortunately, RigelEngine isn’t burdened by the complications of the old EGA hardware anymore. On a modern computer with a dedicated GPU capable of rendering complex 3D graphics, drawing a few layers of 2D images on top of each other is trivially easy, and very fast. Even very low-end systems like a first-generation Raspberry Pi can easily handle parallax scrolling. This allows RigelEngine to (optionally) enhance the original experience with its smooth scrolling and movement mode, which makes the game display at 60 FPS (or higher) without altering the gameplay speed. The parallax effect then takes full advantage of the additional frames available, moving in more frequent 1-pixel steps instead of 4-pixel steps that are further apart in time. This feels much smoother, and is perhaps what the original developers might have done if the technical possibilities had been available to them. Maybe they even would’ve added additional parallax layers if they could’ve..

Here’s a video showing the difference:

This wraps up our look at the parallax scrolling in Duke Nukem II, and the difficulties that made it hard for early DOS games to implement this effect. Game development is as challenging today as it was back then, but the challenges are very different nowadays, and some things that used to be hard – like parallax scrolling – have become commonplace. Digging into the challenges faced by game devs in the 90s, like EGA programming, can be fun, but it’s probably a good thing that putting images onto a screen has become a lot less idiosyncratic.

3 thoughts on “How Duke Nukem II’s parallax scrolling worked

  1. Very interesting article. šŸ‘ I am always fascinated about those deeply technical write-ups about programming in the early IBM PC era. You are spot on about the challenges in the software development world being very different these days. Most people cannot keep up with the gurus of retro programming, and I am no exception. But at least I can appreciate it. šŸ˜Š

    1. Thanks, happy to hear you liked it! Yeah it’s really quite impressive what people managed to do during that time. Analyzing old games like I’m doing – with modern tools and the vast knowledge of the internet – is one thing, but coming up with all these tricks in the first place (when all you had was books and magazines at best) is quite a different story..

Leave a comment