Intel 8086 String Instructions

The Intel 8086 microprocessor, introduced in 1978, was Intel’s first 16-bit microprocessor. It extended the architecture of the 8-bit 8080 to allow addressing up to one megabyte of memory with a segmented address scheme, full 16 bit arithmetic, multiply and divide instructions, and support for coprocessors such as the 8087, which supported hardware floating point arithmetic. Another innovation of the 8086 was what were called “string instructions”, in which blocks of memory as large as 64 kilobytes can be moved, cleared, and compared with a single instruction that runs as fast as the memory can cycle, many times faster than a written-out loop in machine language. A six-instruction memory byte move loop:

loop        mov     al,[si]     ; Load AL from [src]
            mov     [di],al     ; Store AL to [dst]
            inc     si          ; Increment src
            inc     di          ; Increment dst
            dec     cx          ; Decrement len
            jnz     loop        ; Repeat the loop

can be replaced with:

            cld                  ; Clear the Goddam decrement bit
            rep                  ; Repeat until CX = 0
            movsb                ; Move the data block

where the “rep” instruction modifies the subsequent “movsb” instruction (move bytes) to repeat, counting down the CX register for each byte until it is zero. The written-out loop does 11 memory accesses for each byte it moves, nine to fetch the instructions and two to actually load and store the byte, while the “rep movsb” does only the load and store for each byte.

This is a dramatic improvement in efficiency, but poses a problem. Since a repeated instruction can perform up to up to 65535 operations, given the speed of memory at the time, a string instruction could take a substantial part of a second to execute, longer if the memory was slow and imposed wait states on the processor. If the CPU locked up until the repeat instruction was done, it could not process interrupts from I/O devices, leading to lost data from, for example, user keystrokes or characters arriving from serial ports. So, a repeated instruction must be able to be interrupted, then resumed when interrupt processing is complete. But recall that, for example, a move instruction can operate on either 16 bit words or individual 8 bit bytes, can run with addresses either ascending or descending (the dreaded and detested “direction flag”), and 16 bit words may be misaligned with respect to physical memory, which forces two separate byte accesses for each word (programmers try to avoid this because of the performance hit, but if you do it, it still has to work). This makes the process of adjusting everything when the instruction is interrupted exquisitely complicated, with the counter and pointer registers needing to be adjusted by anywhere from −3 to +2 at each interrupt, which is done by the following piece of silicon magic.

All of this and more is explained in Ken Shirriff’s latest installment of his magnificent series reverse-engineering the 8086 at the transistor level, “The microcode and hardware in the 8086 processor that perform string operations”. The explanation of the clever way this is implemented in the processor’s microcode is particularly enlightening.

All x86 architecture processors, 45 years later, support these instructions and must cope with the complexity they create (compounded by extensions to allow moving data 32- and 64-bits at once).

See also the 2022-11-29 post “Fixing a Bug in the Intel 8086 Processor Silicon”.


In the late ‘70s, I got my hands on the datasheet for the soon-to-be-released 8086. That’s when Steve Freyder and I at the PLATO lab at the University of Illinois began work on an OS even before there were any physical chips. I asked Steve, who was my housemate at the time, to write an emulator for the 8086 using COMPASS macros (previously developed by Don Lee*) for the CDC Cyber 6500 on which PLATO was running. The reason I was so aggressive in getting to market first with an OS for the 8086 is that I knew that the first OS to for the 8086 that went into widespread adoption would create a network effect between software developers and software consumers, the quality of which would be constrained by the quality of the OS — and if commercialized would create a software monopoly.

So, we based our design on the DEC PDP RSX OS for real time functionality. Of course, the emulated interrupt structure wouldn’t have handled the nuances of string move instructions locking out interrupts. In any event, that project was disrupted by CDC offering me an opportunity to pursue development of a mass marketable version of the PLATO network, which would have bypassed the personal computer era entirely with network-based computers, but that’s another, tragic, story I’ve related elsewhere.

It probably bears mentioning that Cray had some conflicts with the NSA over their demand for string processing functions – obviously for cryptography – adding what came to be called “exponential kludge” in the Cyber argot I was familiar with. One instruction that the spooks wanted turned out to be really useful outside of string processing: was population count (number of on bits ina 60 bit word).

*Don Lee produced the first ray-trace image (below which appeared in Ted Nelson’s “Computer Dreams”) on a bet with Ron Resch who later went to the University of Utah and worked with Evans Sutherland on more sophisticated ray trace rendering tech. Interestingly, decades later, I had to call this bit of history to Jim Bliss’s attention as he was unaware of it.


The 8086 string instructions do not lock out interrupts. The repeated instruction can be interrupted at any time. backing up the program counter so the instruction will resume when the interrupt handler returns. The complexity comes from operations supporting both byte and word (16-bit) operands, either in ascending or descending addresses, and the possibility that a word operation is performed on an unaligned address, forcing the bus interface unit to make two separate accesses on the 16-bit wide memory bus to load or store the two bytes of the unaligned word.

Most 16-bit architectures required word operations to be aligned and generated an error trap if a word address was odd. Intel chose not to impose that requirement at the cost of substantial complexity in the memory bus interface.


Thanks for that correction of my misreading, although I should have recalled it from my own multiuser chat system on my own quasi-OS written in MASM on a 7MHz 8088 PC circa 1984. In that system I avoided serial port interrupts and supported 24 users running at 2400bps at full speed with instantaneous echo with straight-line code pollilng all ports at maximum character throughput. Worked like a charm.

It even made money in San Diego (the FORA service) until the military (nosc) gave my competitor Usenet feeds and refused to give me the same. You’ll notice “nosc” is one of the Usenet feeds I was using in the late 80s Usenet space.policy wars after I capitulated and had to use my competitor’s system.