Wednesday, 20 February 2019

Es ist kaput - Equipment Failure

Let me begin by telling you a couple of stories. The first involved a hard drive that I was using while I was out and about in South-East Asia. One day I plugged it into my computer and all of a sudden the computer refused to read it. I was rather frustrated because it meant that I had pretty much lost a whole heap of stuff. Fortunately for me there was a backup so the amount of stuff that I had lost wasn't all that much (I was actually carrying two hard drives on me, just in case of that eventuality). Anyway, it turned out that it's failure was purely my fault. I hadn't been treating it kindly, and also had the habit of pulling it out without bothering to unmount it. It turned out that the head had buckled and basically that was that - everything was gone.

Oh, and before I continue, just a quick video explaining why you should always unmount your devices before pulling them out:


Also on that same trip, when I was in Phuket, I purchased a hard drive from a computer shop in a shopping center (never buy hardware from one of the roadside stalls, you can be guaranteed that they will not work - a guy I was talking to in Hong Kong had did just that with a flash drive, and when he discovered that it wasn't working he opened it up to discover that there wasn't actually any interior). Anyway, I attempted to format this hard drive for Linux, and also to encrypt it, and everything I did caused the Hard drive to simply ceased to exist. Honestly, I still don't know what the problem is, but my Dad eventually got it working. The third harddrive I bought in Bangkok is still working now.

The final story involved a 3 TB harddrive that I purchased in Melbourne, and I basically copied all my movies onto it. Anyway, after two weeks that hard drive also went kaput. Fortunately my Dad was around and he managed to pretty much rescue everything, but he proceeded to tell me that the hard drive had pretty much bitten the dust and it was basically useless. I ended up having to purchase myself a new one.

Anyway, as you can tell, equipment will eventually break down, and there are three types of equipment failure: wear out failure, random failure, and infant mortality failure. Basically the chance of wear-out failure increases as time goes on, but it is actually possible to work out the chances of when such an item will fail (which I will get to) - this could be said is what happened in my first example. The second one, as you can guess, pretty much happens at anytime, and the causes are, well, completely random. As it turns out, that was what happened in my third example. The reason it isn't infant mortality failure is because that pretty much happens when it fails the moment you open the package - you could say that this is what the second example (or at least the experience of my friend in Hong Kong, but I suspect that has more to do with him being ripped off as opposed to any inherent fault with the device, but then again, not having any insides could be considered a form of failure).

Anyway, you can graph these types of failures, and when you combine them they produce what is called a 'bathtub curve' namely because it is shaped like a bathtub:


Honestly, there is probably little one can do with regards to random failure. Sure you can purchase extended warranties, and in fact a lot of devices come with warranties as is, and if it does fail within the period of the warranty, then I would certainly recommend calling on the warranty. For wear-out failure, well, that is always going to happen - the second law of thermodynamics sort of attests to that. However, one way of dealing with it is by making sure that wear-out failure only occurs around the time when we are basically going to be throwing the thing out anyway due to obsolescence. As for infant mortality, well, manufacturers now run stress tests on their devices to make sure that they can detect such failures, and then toss out such devices before they actually reach the market.

Now, like pretty much everything where electronic components are concerned, there is another value we need to take into consideration, and that is the Mean Time Between Failure (or MTBF). Now, this is the arithmetic mean which means that this is generally the case, but there will be devices that last longer, and others that don't, and it certainly doesn't mean that this is how long your device will last until it goes kaput (and it certainly won't go kaput the second that its usage goes over the MTBF). Anyway, say a hard drive has a MTBF of 57 years (or 500,000 hours) - this means that of 1000 drives, half will last longer, and half will last shorter. In another way, if you divide 500,000 by 500 you get 1000 hours, which is 41 days, which means that out of those 500 drives, you can expect one to bite the dust every 41 days.

This is a bit of an extreme example

Let us have a look at some of the components that make up a computer and see how failure can be an issue:

Capacitor: Now, capacitors are one of the five basic electronic components (transistor, resistor, inductor, and diode are the others) and they basically store electricity, albeit for a short period of time. They are generally used to smooth out electrical flow, or to induce delays, though they also make up the computer's RAM. In the older days (that is pre-2000s) capacitors had the tendency to leak, corrode, or even burst, and that could cause problems in your system. However, these days they are much more reliable, and generally are able to withstand a lot more than they used to.


Cooling Fan: This has moving parts, so you can be assured there is always going to be a chance that this will bust. Once again, quality does do better with price, though fortunately if your cooling fan fails the computer will probably shut down prior to there being any permanent damage done to your system. However, we do need to make sure that it is configured problem because there is the issue, particularly with tower cases where the motherboard is sitting vertically, that if it, or the heatsink, isn't secured properly it could become lose, or even fall off. Also, having enough space in the case to allow good airflow also helps.

Power Supply: I had a power supply fail on me once, and I was forced to actually fix it myself. Fortunately I had the laptop to assist me which meant that not only did I still have a working computer, but I could also look up the solution on the internet. Anyway, power supplies tend to last between 5 to 10 years before giving up the ghost, but can also be affected by things like surges, lighting strikes, brownouts, and dust. Basically make sure that you aren't maxing out the power usage with your equipment, and certainly don't buy junk. Keeping it free from dust also helps.

Hard Drives: If there is one thing that I would recommend, and that is always unmount them before unplugging them from the machine. Hard drives, being mechanical, are always going to be subject to wear and tear, but there is also the chance of head crashes, which can completely destroy them and everything on it. I would recommend not treating them roughly, and making sure the head is parked before moving it.

Optical Drives: Honestly, the same goes with these as it does for hard drives - to an extent. Being mechanical, and having moving parts, they are going to wear out. Also, be careful that the laser doesn't get dirty because if it does then, well, it isn't going to work. The opening mechanism could also fail, meaning you are stuck with a dud device. However, these really aren't in use anymore, particularly with digital download technology such as Spotify, Netflix, and Steam. Optical disks have basically gone the way of the sextant.

One way of dealing with problems with drives, particularly hard drives, is what is termed as SMART technology.

The SMART Harddrive

SMART stands for Self-Monitoring-And-Reporting-Technologies (how long did they take to come up with that one I wonder?) and is basically designed to work out if there is something wrong with the hard drive, and find a work-around for it. For instance, if there are bad sectors on the drive, the SMART will remember where those bad sectors are and basically avoid them. SMART technologies also work with the operating system to inform you of any problems as well. So, let us take a look at some statistics that it takes into account with regards to determining whether a hard drive is functional or not:

Spin-Up-Time: This is basically the time the hard drive takes to go from a stationary state to the state where it is fully spinning. Obviously if it is taking longer then the drive is starting to wear.

Bad Sector Count: The number of bad sectors that are on the drive. Obviously the higher the number, then the worse the drive is. The more bad sectors, the less space there is on the drive for you to be able to store all of those pictures (yes, you know the ones I'm talking about).

Power On Hours: This is basically the total number of hours that the hard drive has been operational. Probably something that you should be paying attention to considering what we have been speaking about previously.

Power Cycle Count: This is the total number of times that the drive has been turned on, and turned off again.

Spin Retry Count: If the initial spin failed then this counts the number of retries the disk has performed to get up to full speed. Obviously if the drive is starting to fail in this regards then maybe it is time to start looking for a new drive.

Seek Performance Time: This is basically the time it takes for the drive to perform seek operations, namely to attempt to find that saucy picture you have hidden away in your sub directories. If this value is increasing then this may be signs of mechanical wear.

Going on a RAID

RAID stands for 'Redundant Array of Independent Disks' and is basically a bunch of hard drives connected to each other so that they function as a single disk. These are used in a lot server systems, namely because the average consumer simply isn't going to have so much data that they will need a RAID configuration, unless, of course, you happen to be running a a successful Youtube channel, such as this guy:

 

Yeah, basically he's showing us how he turned a USB hub into a RAID system using Flash Drives. Honestly, considering what we said about Flash drives in a previous post, I wouldn't be using this to store any sensitive data, but it does explain how RAID does work.

Anyway, the thing with RAID is that it is not a system used to backup your files. Okay, some configurations do duplicate your data, but that has more to do with data recovery in the case of failure than any form of backup protection. Honestly, you really should be looking at alternate ways to back up your data, and keeping the backup off site is also quite important. The other thing with RAID is that it increases hard drive performance, which means that two 1TB hard drives in a RAID configuration are going to perform better than a single 2TB hard drive.

Now, when data is saved to the drive, it is distributed across the drives evenly, but there are a few configurations for this as well. Since it is being distributed in this way, it means that if you have two drives in the configuration, and one of the drives fails, then, well, bye bye data. This is why RAID configurations generally use multiple drives, and when I say multiple, I generally mean more than three or four. Well, if you are configuring it in a RAID 0 configuration, it doesn't matter how many drives you have, if you lose one, it's bye bye data.

Anyway, RAID 0, otherwise known as 'Striping', means that the data is spread evenly across the drives in stripes. This does not help in the case of hard drive failure, but it does increase the performance of the drive. It is also quite easy to implement, but due to the chance of failure, it shouldn't be used for mission-critical data. The diagram below should give you an idea of how this works:


RAID 1 is called 'mirroring' and basically everything on one drive is mirrored on the next drive. The performance isn't any better than simply having one drive, but if one of the drives fails then you basically have a backup of the data on the second drive, and by replacing the failed drive, you can rescue the data. This can actually be combined with raid 0, as such:

This configuration is known as RAID 1+0 or 0+1.

Now, RAID 5 is much more complicated, but it actually provides the best of both worlds. Basically it strips like RAID 1, but it also has parity block interspaced to assist with redundancy. Basically, if one of the drives fails, then the data from the failed drive can then be restored using the parity data. However, for this to work you need at least three drives, though you can go for anywhere up to 16. The other problem is that it does tend to be expensive, and complicated.


We also have RAID 6, which is also referred to as 'Double Parity'. Basically the difference is that the number of parity sectors on the drive are doubled, so instead of there being one parity disc, there are two. This does mean that reconstruction time in the case of a failure is increased. However, if a second drive fails while one is being re-constructed, then the data has been lost. This is why RAID is no substitute for a secure backup.

Now, let's work out how long it will take to reconstruct a disk. So, we have 20 drives, consumer grade, holding 500 GB each, in a mirroring configuration, and write time is 90 MB/s, how long will it take to rebuild the failed drive?

Well, each of the drives has 500 GB, which makes it 500 000 MB, and at 90 MB/s we get 5556 seconds, or 92 minutes, or an hour and a half. If a second one dies, well, since it is in a mirroring configuration, as long as it is not the drive from which the data is being reconstructed, then you are pretty much in the clear. Oh, and since you have to read the contents of the drive, you will need to multiply that by two, which gives you about three hours.

Redundancy also works for power supplies, and you generally see this in computers that basically remain on, for, well, forever. This means that if we need to replace the power supply, turning it off is not an option, and in fact we really don't want the computer to shut down if the power supply fails. Okay, maybe if the computer is the server that houses Facebook, then it certainly won't be the end of the world, but if it is the servers that control the flight computers at Heathrow Airport then that is a different story.

Anyway, the way that works is that you have two power supplies that pretty much do the same thing, namely supply power. However, if one fails then the other can pretty much do the job all on its own. As such, you are then able to remove the one that has failed and then replace it with one that works. It's as simple as that.

Creative Commons License

Es ist kaputt - Equipment Failure by David Alfred Sarkies is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license only applies to the text and any image that is within the public domain. Any images or videos that are the subject of copyright are not covered by this license. Use of these images are for illustrative purposes only are are not intended to assert ownership. If you wish to use this work commercially please feel free to contact me

No comments:

Post a Comment