Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

A local company has asked for your consultation in order to improve its network\

ID: 3662749 • Letter: A

Question

A local company has asked for your consultation in order to improve its network's
fault tolerance. The current network carries critical administrative, management, financial
and accounting date in real time from both a mainframe host and several servers to
workstations in the various offices and workshops. All data transferred is highly
confidential and must not be lost or accessed by unauthorised personnel.
After an initial visit you have found out that the network is configured as follows:
Eight hundred workstation are connected to five shared servers running Linux.
Fifty of these workstations serve as training computers for new staff. Two hundred
workstation sit in the central administration offices and are used to view and
update various data. Twenty workstations are used in the R&D offices. The
remaining workstations are used in the various offices found in the building.
The clients are connected in a mostly switched, star-wired bus network using
Ethernet 100Base-T. In the few instances where switches are not used, hubs serve
smaller workgroups.
An internet gateway supports e-mail, online activities and VPN communications
with three remote offices. The Internet connection is an 80Mb/s link to a local ISP.
A firewall prevents unauthorised access from the internet connection into the
company's network.
The IT manager has asked you to identify the critical points of failure in this network and
to suggest how these might be eliminated. As part of your report you should draw a logical
diagram of the network and identify single points of failure, then recommend which
points of failure should be addressed to increase availability and how to achieve this goal.

Explanation / Answer

Ans:

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails.[1] The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured. A structure is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.

Within the scope of an individual system, fault tolerance can be achieved by anticipating exceptional conditions and building the system to cope with them, and, in general, aiming for self-stabilization so that the system converges towards an error-free state. However, if the consequences of a system failure are catastrophic, or the cost of making it sufficiently reliable is very high, a better solution may be to use some form of duplication. In any case, if the consequence of a system failure is so catastrophic, the system must be able to use reversion to fall back to a safe mode. This is similar to roll-back recovery but can be a human action if humans are present in the loop.

Adressing Above defined Network System Terminology(Common for all dept.):-

A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.

A system that is designed to fail safe, or fail-secure, or fail gracefully, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a graceful exit (as opposed to an uncontrolled crash) in order to prevent data corruption after experiencing an error. A similar distinction is made between "failing well" and "failing badly".

Fail-deadly is the opposite strategy, which can be used in weapon systems that are designed to kill or injure targets even if part of the system is damaged or destroyed.

A system that is designed to experience graceful degradation, or to fail soft (used in computing, similar to "fail safe") operates at a reduced level of performance after some component failures. For example, a building may operate lighting at reduced levels and elevators at reduced speeds if grid power fails, rather than either trapping people in the dark completely or continuing to operate at full power. In computing an example of graceful degradation is that if insufficient network bandwidth is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version. Progressive enhancement is an example in computing, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display available.

In fault-tolerant computer systems, programs that are considered robust are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely. Software brittleness is the opposite of robustness. Resilient networks continue to transmit data despite the failure of some links or nodes; resilient buildings and infrastructure are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.

A system with high failure transparency will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. Likewise, a fail-fast component is designed to report at the first point of failure, rather than allow downstream components to fail and generate reports then. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.

Criteria:-

Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which components should be fault tolerant:

Requirements:-

The basic characteristics of fault tolerance require:

In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called availability and is expressed as a percentage. For example, a five nines system would statistically provide 99.999% availability.

Fault-tolerant systems are typically based on the concept of redundancy.

Replication:-

Spare components address the first fundamental characteristic of fault tolerance in three ways:

All implementations of RAID, redundant array of independent disks, except RAID 0, are examples of a fault-tolerant storage device that uses data redundancy.

A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.

Lockstep fault-tolerant machines are most easily made fully synchronous, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement.

Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.

One variant of DMR is pair-and-spare. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially.

If you want to find a solution that allows some level of fault tolerance for the above defined network architecture, you are looking for a storage solution that, in the event that something fails, in any operating system, the system can still run properly without failing completely. Fault tolerance in software or storage solutions usually utilizes mirroring. Mirroring means that the system performs operations on more than one system – so that in the event of a failure, the system doesn’t lose any information, and the user can continue working on a separate system.

How Does RAID Affect Fault Tolerance?

RAID storage solutions have different levels – most commonly used are:

Deciding on a software or hardware version of RAID is equally as important. The software version of a RAID solution supports fewer of the RAID levels you may need than the hardware RAID does.

What RAID Solution Is Best For Above defined Network?

Analyze your company. Do you value fault tolerance more than the speed and performance of your system? If so, RAID 1 or RAID 10 may be the best option. If you are more concerned with the performance of your system, RAID 0 and RAID 5 would be a good decision. If you value fault tolerance and system performance equally, spending the extra money for RAID 6 or RAID 10 – and ensuring that your system will not suffer in performance, and your data is safe from system failure – are the better options.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote