Wednesday, 21 November 2018

Processing the Pipeline

So now we return to our beloved CPUs and delve in deeper at to how they operate. We have already had a brief overview but now we return to look at their inner most workings.

So, first remember the instruction cycle: fetch - decode - execute - store? Well, as mentioned, each part of the cycle takes one clock tick, so basically for a computer to go through the entire cycle will require four clock cycles. Meaning that the time it takes to complete the one instruction is four cycles, and once that is complete one can then move on to the next instruction. It is sort of like, well, the process of making a burger - you take the order, you cook the beef, you make the burger, and then you hand it to the customer. Imagine, say, if the next customer had to wait until the first customer had got their burger so that they could be served. It would be pretty annoying, wouldn't it. This is why they introduced a process called pipelining.

Pipelining is basically like the assembly line, or the modern fast food restaurant. In our example above, once the first order is taken and the beef is being cooked, the cashier then takes the next order. Once the beef has been cooked, the beef cooker then receives the next order and cooks the beef for that order, while the first order is having all the goodies added to it. Now, the process itself still takes the same amount of time, it is just that you are now able to make more burgers  - basically the throughput increases, so that instead of taking 16 cycles to make 4 burgers, you are actually able to make 4 burgers in 7 cycles.

The same works for processing instructions inside a CPU, as the diagram below shows:
Pipeline, 4 stage.svg
By en:User:Cburnett - Own work This vector image was created with Inkscape., CC BY-SA 3.0, Link

So, as you can see, once one task has been completed, the next task begins, so while the speed of the tasks isn't necessarily any faster (you can only do that by increasing the clock speed), the number of tasks that can be completed over a period of time (the throughput) increases. In a way, it is like adding three extra people to the task of making the burger. Oh, and this is actually pretty simplified since CPUs have broken these specific tasks down even further, so you will have CPUs that now have at least 14 stages in the cycle.

However, that doesn't necessarily mean that everything is fine and dandy. Now, take for instance, using our burger analogy, that the stages are calculated based on a simple cheeseburger. However, an order comes through for a grand deluxe burger. Well, all of a sudden one of the stages takes longer to complete than for the simple cheese burger, which means that the next burger along the line has to wait until the grand burger deluxe has been completed before work can start on the next burger. Once again, the same is the case with CPUs, and what happens is that the process stalls, because the next instruction cannot start until the current one has been completed. The terminology is referred to as a 'pipeline stall'. Further, when the complex instruction is completed and moves on, this creates a point where no work is able to be done because the next instruction has been delayed, which is referred to as a pipeline bubble.

Once again, we have a diagram to help us understand what is happening here:

Pipeline, 4 stage with bubble.svg
By en:User:Cburnett - Own work This vector image was created with Inkscape., CC BY-SA 3.0, Link

So, in the above example, the fetch stage for the second instruction has taken longer, which has created a gap, or a bubble, between the first and the second instruction where a part of the processor is sitting idle.

Now, there are other ways stalls can occur, which are also referred to as hazards. First of all we have your typical hardware fault, however there is also the problem of branching. Basically instructions are fed into the processor one after the other - in fact that is how they are fetched from the memory. However, one of the instructions that reaches the execute phase tells the computer to jump to a completely different part of the memory and fetch what ever instruction is there and execute that. Well, what's happened is that we are at stage three and there are already two other instructions, instructions that are no longer needed, in the pipeline. Once again we have a stall because these instructions have to be discarded and we suddenly start again a square one, and we find that for a period of time the processor is sitting idle.

There is a way around this, and it is called Branch Prediction, where the computer tries to predict where a process will branch out to a different part of the memory and act before needing to flush any instructions in the pipeline. However, this can be a bit of a double-edged sword because if the computer gets it wrong, then all of a sudden it is stuck with all this data it doesn't need and will need to flush that data out as well creating, yep, a pipeline stall.

Now, before we continue, lets have a chat about speed. First of all, speed is measured in seconds, but not the seconds that we have on our clocks, but much much much smaller increments of time: micro-seconds and nano-seconds, and even milli-seconds. Here is a nice chart that sort of explains things:

Now, the number of clock cycles that occur in the period of one second is referred to as the frequency (sort of makes sense) and is measured in Hertz. You might have heard this in reference to radio communications, and that is pretty much the same thing, except we are measuring the number of waves that pass in a second. So:

  • A kiloHertz (kHz) is 1000 cycles per second;
  • A megaHertz (MHz) is 1 000 000 cycles per second (a million);
and you guessed it,
  • A gigaHertz (GHz) is 1 000 000 000 cycles per second (a billion).
So, take the CPU in this computer, an AMD Phenom II which has a clock speed of 200 Mhz. That means that my CPU processess 200 million cycles per second, and divide that by four, it means that it can process up to 50 million instructions per second (working on the four stage instruction cycle). However, the question is raised as to how long does it take to process a single instruction? Well, you work it out by inverting the frequency. ie:

1/200 000 000 = 0.000000005.

Let us break this down a bit to work it out: 0.000 000 005. So, we have 8 zeros before our five, so the answer will be 5 x 10-8, which is 5 nanoseconds, and that is comparatively slow considering the age of my desktop. Oh, and multiple that by 4, and it takes 20 nanoseconds to complete a four stage instruction cycle.

Well, that was fun, let's do it again for my laptop. Well, it is an AMD A9 Radeon that operates at 3000 Mhz. So, we break that down to basically 3000 million cycles per second (which equates to 3 billion cycles per second). In fact, that should actually be 3 Ghz. So, lets find out how long a cycle is:
1/3 000 000 000 = 0.000 000 000 3,

This translates to 3.33 x 10-10 , which translates to 3.33 nanoseconds for a single clock cycle (and 12 nanoseconds to complete the instruction cycle).

Superpipelining and Such

Well, it looks like things just might get a little more complex. First of all we have superpiplining where the processor brings in the next instruction before the first fetch instruction has been completed, which once again doubles the throughput of the system. Once again, the problem is that if you have to flush the CPU because of an incorrect branch prediction (or even no branch prediction) then all of that has gone to waste.

We also have the superscalar architecture, which will perform two instructions in parallel, once again increasing the throughput. What this actually tells us is that two processors aren't always the same, even if they are advertised at operating at the same speeds. Sure, they emblazon 4 Ghz Intel on the computer package, but that only tells us one thing about the processor. Look, if it doesn't actually have any pipelining, then it might not actually be better than the 3 Ghz processor that has superscalar pipelining.

Anyway, another diagram to help us understand what I'm gas bagging about here:

Mind you, that's all by the by because these days computers are both super piplelined and super scalar, and are also pretty deeply pipelined, it's just that you can never seem to find these particular details (such as how deep the pipeline actually is) on any of the spec sheets.

However, there are problems, namely that while they may operate at greater speeds, and greater throughput, they are also chew through quite a lot of power. Then there are the problems with pipeline stalls, especially if the branch prediction, is, well, rubbish. In fact, an older CPU actually might turn out to be more efficient than one of these new, beaut, super-pipelined monstrosities.

In fact, it has now been discovered that a processor with a whopping 31 stage pipeline is only slightly better than it's predecessors.

About those Cores

So, how many cores does your CPU have? This desktop has two, the TV box in the lounge room has four, and the laptop has 5. Oh, I can't forget the mobile phone - that has 8. So, what are these cores? Well, simply put, they are basically CPUs. When we talk about multi-core processors we are basically talking about multiple CPUs being squeezed into a single chip. This basically helps with, once again, throughput, but also with multitasking. Basically computers can really only do one thing at a time (unlike humans who can do multiple things, such as driving a car while listening to the Beatles and drinking a beer), however we are sometimes given the illusion of multiple things occurring a single time. Multiple core CPUs can change that. Once again, here is a diagram:

And here's another example to take a look at, this time a little less abstract:

That sort of puts it into perspective. Also notice how the CPU also has an onboard graphics processor as well. They really know how to make things compact, and this has been enhanced as well. If you open up your computer and look at the size of your CPU you will notice that it is about a sixth, or even less, the size of this processor. Oh, and we aren't even talking about mobile phones yet.

Now, another thing is that you simply cannot put a multi-core processor into your computer and expect an immediate performance upgrade. The thing is that the software needs to be configured to be able to do this. Sure, most of that is done automatically, but you may have to wait a while for your computer to down load the Windows updates to allow this to happen.

Oh, and there is also the cache that I should mention (but I will get to in more detail in the post on memory). Basically the cache is memory that is inside the CPU. There are three levels, ironically called levels 1, 2, and 3. Now level 1 and 2 cache are generally associated with specific cores, while level 3 cache is shared among the cores. What cache does is that it stores instructions and data so that the CPU doesn't need to repeatedly return to the RAM to get its next set of instructions. Once again, good branch prediction is required to actually know what needs to be stored in the cache.

Like pipelining, multi-core processors also tend to be pretty power inefficient, but they to have the advantage of increasing performance, and cost.


Okay, this is where it gets a little tricky. Multi-threading can only work on superscalar CPUs (remember them?). Anyway, this is where a task can be divided up into multiple threads, and these threads are then executed concurrently. Sound's confusing, well it is, namely because this is one of those really new ideas that has been designed to increase performance. The other thing is that it enables the process to take full use of the CPU, so if a part is sitting idle, it can then execute another thread. However, there is another advantage in that the threads can actually talk to each other and work off of each other.

However, and there is always a however when it comes to these things, and that is that for multi-threading to work, in the same way that multi-core processors work, is that the software needs to be configured to take advantage of this benefit. Now this is where the problem lies. The thing is that not all software can take advantage of this, so developers need to consider, when developing their software, whether multi-threading will be a benefit to their program.

On to Mobile Phones

Well, everybody seems to have one of them these days, and they pretty much function like a miniature computer. Well, there actually is a difference between the architecture within your computer and on the mobile phone, and that has a lot to do with the CPU. Now, your typical computer uses what is called a CISC, or Complex Instruction Set Computer, and the mobile phone uses what is called a Reduced Instruction Set Computer. What does that mean? well, I'll try to explain.

Now, you know how we have been talking about instruction cycles? Well, CPUs are programmed to recognise a series of instructions and how to execute them. The difference is that a CISC processor has a much larger library of instructions than does the RISC processor. Basically, what a CISC processor does is that it can compress multiple instructions into a single instruction, while the RISC processor has a limited set of instructions and must pretty much go about the long way to get the same thing done. Language can be a bit like that. For instance, the word for mobile phone in German is Handy. Where we have two words to describe something the Germans only have one. That is sort of the way the differences between CISC and RISC can be viewed.

Basically, programs for CISC processors tend to be much shorter and succinct, while programs for RISC processors tend to be much longer and much more convoluted. That is actually one of the reasons why half the apps on your mobile won't actually work without an internet connection - the program isn't on your phone, it's on a server elsewhere, and the phone only accesses the instructions that it needs to execute at that particular time. This isn't necessarily a problem with normal computers, though with pretty much everything moving online there will be a time that the only program you fire up on your computer will be your browser.

Let me try to show it mathematically:

Consider the equation above. Now, the seconds per cycle will stay the same, so we can get rid of that, and we can also assume that the program is 1, so all we need to do is to consider the number of instructions. Now, as the number of instructions increase, the number of cycles per instruction will actually decrease, however when the number of instructions decreases the cycles per instructions increase. As such the result for a CISC processor will actually be smaller than the result for the RISC processor. So, what is happening, is that RISC is sacrificing the cycles per instruction for a less complex processor, and that can be solved through pipelining.

Now, the advantages of the CISC processor is that complex instructions can be stored in the hardware, which means that there is less work for the programmer to do. As such, there is greater support in the CISC processor for high level languages (that is human readable computer code as opposed to low level languages, which is much, much closer to the 0s and 1s). Now, with the RISC processor, you need more instructions to perform the same task, however this means that the CPU has more space for general purpose hardware as opposed to all this space taken up by an instruction set. With pipelining available, the speed can actually be quite similar.

Take this example of multiplying two numbers together:

Mult A,B

LD R1, A
LD R2, B

(Basically we are loading A and B into registers 1 and 2, multiplying A and B, and then storing the result in register 3, though with pipelining, tasks can be done simultaneously).

RISC processors are smaller, and more energy efficient. Once again, if you look inside your computer you will see this massive thing sitting on top of your CPU. This is the heatsink and the fan, designed to keep the CPU cool. Well, the problem with smartphones is that you can't actually fit them into the device, so having a power hungry processor is simply going to result in a device that will not work. Also the design means that you can combine the entire chip set into a single chip (known as an SOC). If you look at mobile phone specs for, say, my phone (a HTC One M9) you will note that they say that the CPU is an Octa-core processor, however the chipset is a Qualcomm Snapdragon. The reason being is that the chipset actually contains a lot more hardware than does the conventional CPU.

The Graphics Card

Remember how I mentioned the graphics processor that was onboard the CPU. Well, it turns out that graphics cards also have their own processor. Look at the one below:

See how there are a couple of fans on it. Well, these days the GPUs (graphical processor unit) are much more power hungry than they were back in my university days. Actually, I was going to say that I don't have a graphics card in my computer, until I realised that the monitor is actually plugged into one, and the system specs says that it is a GeForce g98:

Yeah, it's pretty old. Anyway, the major difference is that CPUs are designed to perform a wide range of tasks where as the GPU are generally designed to perform the same task over and over again. As such you will find that a a lot of the fancy aspects of the modern CPU have been tossed simply to add additional cores. The thing is that graphics processing isn't all that complex, it is just performing the same task over and over again, which is why having the CPU do it is sort of a waste. Oh, and it is also the reason why bitcoin miners also like to use graphics cards for their work (though there are much better ways to mine bitcoin).

Creative Commons License

Processing the Pipeline by David Alfred Sarkies is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license only applies to the text and any image that is within the public domain. Any images or videos that are the subject of copyright are not covered by this license. Use of these images are for illustrative purposes only are are not intended to assert ownership. If you wish to use this work commercially please feel free to contact me

No comments:

Post a Comment