Chap#11

Understanding Fault Tolerance (Slides 1-3)

What is Fault Tolerant Distributed Computing and why is it important?

Fault Tolerant Distributed Computing is the design of distributed systems that can continue to function correctly and without interruption even when some components fail. Fault tolerance ensures the system remains operational, recovers from failures seamlessly, and maintains service availability without noticeable impact to the user.

Q: What are the common drawbacks or trade-offs of implementing fault tolerance?

A: Implementing fault tolerance can make the system slower, require more disk space, use more machines, and increase overall costs. There's always a trade-off between cost and the degree of fault tolerance.

Failure vs. Error (Slide 4)

Q: What is the difference between a system failure and an error?

Failure: Occurs when the system does not behave as expected (e.g., becomes unreachable or produces incorrect output).

Error: An incorrect state within the system that may lead to a failure. Errors can sometimes be detected and corrected before they cause a failure.

Phases of Fault Tolerance (Slide 5)

Q: What are the three main phases in handling faults for fault tolerance?

Error Detection: Identifying that an error has occurred.

Damage Confinement: Preventing the error from spreading to other parts of the system.

Error Recovery: Removing the error or its effects so the system can continue operating correctly.

Types of Faults (Slides 6-7)

Q: What are the three kinds of processor faults? Briefly describe each.

Fail Stop: The processor completely fails and stops responding. Other processors can usually detect this.

Slowdown: The processor works more slowly than usual or might stop working completely over time

Byzantine: The processor acts strangely. It might stop working, run slowly, or look normal but secretly give wrong results or try to mess up the work

Q: What are network faults? Give two examples.

Network faults occur when processors cannot communicate effectively. Examples include:

One-way Links: A processor can send messages, but another processor cannot receive them.

Network Partition: A part of the network becomes completely isolated from other parts.

Attributes of a Fault Tolerant System (Slide 8)

Q: What are four important attributes of a fault-tolerant system?

Availability: The system is ready and able to perform its functions at any given moment.

Reliability: The system can operate continuously without failure over a specified period.

Safety: If the system fails, it does so without causing major disasters or negatively impacting other systems.

Maintainability: Failures can be easily detected and repaired.

Types and Classification of Failure (Slides 9-10)

Q: Name a few types of server failures and what they mean.

Crash failure: The server was working fine but suddenly stops completely.

Omission failure: The server doesn’t reply to requests either it doesn’t get them or doesn’t send a response.

Timing failure: The server replies too soon or too late, not within the expected time.

Response failure: The server replies, but the answer is wrong.

Arbitrary failure (Byzantine): The server can do anything wrong at any time, including sending random or misleading responses.

Q: How are failures classified (e.g., by duration)?

A: Failures are classified as:

Transient: Appears once and then disappears on its own.

Intermittent: Appears, disappears, and then reappears repeatedly.

Permanent: Continues to exist until the faulty component is repaired or replaced.

Fault Tolerance Mechanisms (Slides 11-15)

Q: Name three main fault tolerance mechanisms in distributed systems.

Replication-based fault tolerance technique.

Process level redundancy technique.

Fusion-based redundancy technique.

Q: What is the core idea of replication-based fault tolerance?

A: To replicate (copy) data onto other machines or servers. If one machine/server fails, the system can continue using a replica, preventing a total system stop.

Q: What are two major problems or challenges with data replication?

Consistency: Ensuring all copies of the data remain consistent, especially when clients update data.

Degree of replica: Achieving high fault tolerance might require many replicas, increasing complexity and cost.

MCQs

What are transient faults?

A) Permanent hardware failures

B) Faults that disappear on their own

C) Faults caused by software bugs

D) Continuous power supply issues

Answer: B

Q2: Which of the following is a technique used to handle transient faults?

A) Memory paging

B) Load balancing

C) Comparing process outputs

D) Data encryption

Answer: C

Q3: What does the Checkpoint and Rollback technique do?

A) Speeds up processing

B) Encrypts all data

C) Saves and restores the system state

D) Increases memory usage

Answer: C

Q4: What issue does the fusion-based technique aim to solve in replication?

A) Data inconsistency

B) Network delay

C) High cost of multiple backups

D) Security vulnerabilities

Answer: C

Q5: What is a major drawback of the fusion-based technique?

A) Poor data accuracy

B) High recovery overhead

C) Increased energy use

D) Slow normal operation

Answer: B

Cover Letter

Search This Blog

Chap#11

Comments

Post a Comment

Popular posts from this blog

Chap#10

Ai Mental Health & Cyber Safety Presentation