A system and method for preventing cascading failures of clusters in a large-scale distributed system are disclosed. An example method begins with determining the current system conditions including the state and capacity of each cluster in the system. Given the current system conditions, the maximum number of entities that can be served by the system may be determined. The determined maximum number of entities are then served. In the event of a cluster failure, a determination is made as to whether the entire load from a cluster can be failed over by the system without creating cascading failures. Responsive to determining that the entire load from a cluster cannot be failed over by the system without creating cascading failures, a partial amount of cluster load is identified to failover in the event of cluster failure.