Integration – [Processor, Memory] Vs. [Visiting Places, People]

Today I want to try out explaining the integration between processor and memory using visiting places and people as the reference point.

Have you ever observed the in and out queues at various visiting places like large Zoo, famous Temple?  Can you reason out why they are the way they are? 

Zoo – Multiple entry gates and multiple exit gates

Temple – One entry gate and one exit gate

Why does not a temple have multiple gates?  Why does not a Zoo have only gate?

Zoo – After the entry gate, the possible ‘views’ are many.  One can go to animals view, another can go to birds view, yet another can go to trees view.  The more the views possible, the better consumption of people into views.  So, allowing more people at a stretch does not hurt but makes the system better.  Less entry gates only increase the queue lengths and results in insufficient Zoo usage.

Temple – The only ‘view’ is holy deity.  There is only one ‘view’.  So, allowing more people is going to make the situation worse.  You know how good the humans are at self disciplineSmile (Of course, there are exceptions). 

What does that observation bring, the in-flow and out-flow must be designed with actual ‘view’ or ‘consumption’ system in mind.  Any superior in-flow (many entry gates) designed without thinking of main consumer (temple) is going to create a mess. Any inferior in-flow (few entry gates) when main system (Zoo) is heavy consumption ready reduces the usage efficiency.

When it comes to computers, processor and memory are designed the same way. 

Processors are designed with ‘words’ pattern than ‘bytes’ pattern.  For example, you hear 16-bit processors, 32-bit processors, 64-bit processors.  So, 16-bit, 32-bit, 64-bit are words.  Processors process a word at a time.  The registers, algorithmic logic unit, accumulator, etc. all are in sync with ‘word’ pattern. 

Let us come to the memory and see a bit more into it. 

Byte addressable memory is a memory technology where every byte can be read/written individually w/o requiring to touch other bytes.  This technology is better for software as multiple types can be supported with ease.  For example: extra small int (1 byte), small int (2 bytes), int (4 bytes), long int (8 bytes), extra long int (16 bytes) can all be supported with just ‘length of the type in memory’ as design point.  No alignment issues, like small int must be on 2-byte address boundary, int must be on 4-byte address boundary, and so on so forth.  Surely, from the software point of view, byte addressable memory is a right technology.  But this memory is a bad choice for processor integration. 

Word addressable memory is a memory technology where one can read/write only word at a time.  This is better for processor integration as processors are design with ‘word’ consumption.  But, it suffers from memory alignment issue being surfaced to software and have to be dealt at these layers.  They also bring challenges like Endianness problem with different ‘word’ patterns (in processors).

From processor-memory integration, ‘word addressing’ wins.  From software-memory integration, ‘byte-addressing’ wins.

Hardware is manufactured in factories (and is hard to change post the fabrication).  Where as, software is more tunable/changeable/adaptable – change one line of code and recompile,  the change is ready in hands (deployment is a separate issue though).  So, the choice is on our face.  That is, choose the right memory for processors and let the problems be solved at upper layers like software, compilers.

So, compilers came up with techniques like padding.  Compilers also support packing to help developers make choice and override compiler inherent padding behaviors.

With all that understanding, let us take an example of simple primitive and reason to understand all these design choices.

Memory Copy:  Copy byte values from one memory location to another memory location

Signature: memcpy(source, sourceOffset, target, targetOffset, count)

It is very common for any program require copying of bytes from one location to another location (network stack is famous example).  In a simplistic code, memory copy primitive should be like (data types, bounds checking, etc. are excluded for brevity):

for (int offset = 0; offset < count; offset++) 
    target[targetOffset + offset] = source[sourceOffset + offset]

As a software programmer w/o knowing underlying design details, this looks like correct and performant code.  Well, software engineers are smart Smile and would love to learn.  We know that SDRAM is the memory technology, and the hardware is ‘word’ based.  That means, even if I were to read byte at address ‘x’ – the underlying hardware is going to fetch ‘word’ at a time into processor.  Processor then extracts the required byte (typically using ALU registers) from that word and passes the byte to the software program. 

What does this mean to the above code?

Assume source, target offsets are aligned on word boundary.  Let us say, word is 64-bit.

When for loop offset = 0, target memory location bytes from sourceOffset + offset to sourceOffset + offset + 8 are read (that is, one word).  Because the software requires (or asks) only first byte, it is extracted and other bytes are thrown away.  Again when for loop offset = 1, same location is read again from RAM, but a different byte (second byte) is extracted and given to software.  So on so forth, till offset = 7.

So, for offset = 0 to offset = 7 – the code is inherently reading the same word from RAM for 8 times.  So, why not fetch only once and use it in a single shot. Well that is what, memcpy primitive code does (a learned programmer’s code).  Here is a modified version:

// Copy as many ‘words’ as possible

for (int offset = 0; offset < (count – 8); offset+= 8)
    *(extra long int *) (target + targetOffset + offset) = *(extra long int *) (source + sourceOffset + offset)

// Copy remaining bytes that do not complete a ‘word’

for (/* continue offset value */; offset < count; offset++)
    target[targetOffset + offset] = source[sourceOffset + offset]


Well in reality, the memcpy code is not as simple as above.  Because, target and source offsets might be such a way that they are not word aligned.  If I am not wrong, memcpy could actually have assembly code directly (and some implementation does have assembly code).  After all, it is all about mov (to move word), and add (to increment offset) instructions (I remember my 8086 assembly programming lab sessions!).

This padding and packing also are super-important when one is worried about performance.  Padding helps in having the content/data/variables aligned.  Otherwise, efficient code like above won’t be useful at all and results in performance issues.

That is all for now, thanks for reading.  If you like it, let me know through your comments on blog.  Encouragement is the secret of my energy Smile.


Laxmi Narsimha Rao Oruganti (alias: OLNRao, alias: LaxmiNRO)