High-availability-and-fault-tolerance-in-call-center-environments-Platform Contact centers are mission-critical systems.Customers are first class citizens of any business. The communication with them is an opportunity for companies to put their best foot forward. Consequently, any issue during the communication may lead to customer dissatisfaction with the service, or even with the business as a whole.
Contact center systems help companies manage high volumes of customer communication. Any interruption or degradation of service in their operation has multiplied impact.
This is why high availability (minimized service unavailability or downtime) and fault tolerance (continued availability as well as a continuation of service when some parts of the system fail) are of utmost importance in contact center systems design and operation.They are even more important for cloud-based contact center software systems.
The kinds of issues that can impact contact center systems include:
- Internet access and services
- Voice provider issues
- Power outages
- Hardware failures
- Software failures
- Maintenance requirements
- Security issues
- Operator errors
Many of the issues can be addressed in contact center’s software platform design and implementation.
Highly Available and Fault-Tolerant Cloud Contact Center Software Platform
The availability design principles followed during the development of Bright Pattern are:
- Elimination of single points of failure
- Load balancing
- Fast fault detection
- Exclusion of failed components
- Switch over and rebalancing
- Real-time configuration updates
- Provisioning for spare capacity
- Introduction of component and server into service and out of service on the fly
- Disaster recovery to another datacenter
- Security hardening
- Friendly user interfaces
Elimination of single points of failure
Bright Pattern comprises software components running on multiple servers. The components communicate with each other, forming a logical cluster.
All components must be provisioned to be present in more than one instance. In addition:
MongoDB, used for high-performance data storage, is deployed in master/slave replica sets – redundant groups of servers with the on-the-fly switchover.
MySQL, used for configuration storage and reports, is deployed in master/slave pairs with the on-the-fly switchover.
Real-time configuration updates
The components in the system keep listening to configuration updates. Nothing ever has to be restarted to enact configuration changes.
When a component requires services of another component, it uses an active component list from configuration service. To service a request, it selects components from the list in a round-robin fashion.
This ensures that there are no standby components, as all components perform work continuously and there are no surprises when services are switched from one to another.
Fast Fault Detection
There are two mechanisms for fault detection in Bright Pattern:
All components keep a connection to configuration service components. The connection is used for runtime configuration update notifications, as well as keepalive heartbeat. The configuration services use keepalive information to update component operational status in operations data store, and initiate rebalancing, if applicable.
In addition, many components working closely with each other maintain direct connections with their “subcontractor” components. They also use keepalive heartbeats running over those connections to detect whether a request for service can be sent there to be completed on time.
Switchover and rebalancing
Once a fault is detected, the failed request is resubmitted to another component.
The keepalives from configuration services are also used to rebalance components that keep a state of specific entities. For example, interaction routers each have assigned tenants they route interactions for. Once a router it detected to be failed, configuration service redistributes its tenants to other routers that are deemed to be operational and informs the rest of the system about the fact.
Recognizing that upon component failure all interactions serviced with that component are impacted, and, if dropped, may lead to impact to customer satisfaction, we employed a number of mechanisms to spread the interaction state information across the system so the interaction could continue even if the key components servicing it have failed.
Some examples include:
- If a media server is detected to be failed, the corresponding signaling server selects another media server and tells all endpoints in the conversation to start using it. This results in a short period of silence followed by resumption of the conversation.
- Failure of a router results in interactions waiting in the queue being resubmitted to another router. No information is lost, some interactions routing can delay by about 30 seconds
- If a call scenario component fails, the call is resubmitted to a new one along with basic state kept in corresponding signaling component. Calls on agent simply continue, calls in queue continue waiting, and only calls in IVR restart IVR menu from the beginning.
- If an agent server fails, its duties are taken over by another agent server and agent desktop applications, upon an attempt to automatically reconnect, are connected to that instance. The impact is a several seconds delay in agent desktop operations with a “please wait, reconnecting” message displayed.
Provisioning for spare capacity
The system may have many component instances, but the actual load could use up all provided capacity, and at the event of a failure lead to overloading of the remaining components.
Bright Pattern software platform provides SNMP counters for the operations personnel to assess the levels of system resource use and alert when these levels are exceeded and require additional capacity to be brought in.
Note that redundant capacity does not mean duplication of all components, as we use N+1 paradigm, which means redundant capacity must exceed or be equal to possible falling capacity.
Additional capacity could also be provisioned to handle component maintenance comfortably, without sacrificing system survivability.
Introduction of component and server into service and out of service on the fly
In 24/7 operation, taking the system down for maintenance or upgrade is almost impossible.
This is why we implemented soft maintenance shutdowns and introductions into service of servers and separate components.
The components or whole servers can be taken out of service, in which case they stop receiving new transactions and wait until old transactions end and then shut down.
New components and whole servers can also be introduced into the system on the fly.
As an example, a server replacement could be carried out by adding a new server and then taking old server out, on the fly, without impact to an operation or even processing capacity.
This way the system can be moved not only between servers but can also between racks or even data centers.
Disaster recovery to another datacenter
The system uses slave replication capabilities of MongoDB and MySQL to pass up-to-the-minute customer information to DR locations at real-time. At the event of a disaster at a primary data center, operations personnel switched backup instances into primary masters and start the servers. The communications are transitioned by re-advertisement of IP network over BGP to a new location or by fast DNS update.
The internet facing servers are specifically penetration-tested and hardened with each release, for example, the system can work behind SIP session border controllers, but it does not require them.
In any case, using intrusion detection and intrusion prevention systems with any SaaS platform is a good practice.
Friendly User Interfaces
Bright Pattern user interfaces are designed from user role task flow point of view. The developers come up with the solutions only after they actually think through what user has to do.
In addition, BrightPattern operations personnel uses these same user interfaces and provides feedback directly to the development team.