Business Continuity Plan
This business continuity plan analyzes our risk and response for a number of different issues. They are grouped by severity.
Also relevant is the disaster recovery plan, which provides a plan of action for some of these scenarios:
SEV-1 Threats
Complete outage of service in the GCP and AWS regions that Avocode operates in
Description
Occasionally, our hosting providers experience outages that adversely affect our services. Even though these hosting providers build their infrastructure with reliability and resiliency in mind, occasionally some mistakes happen.
Effect
The Avocode service is not operational. New designs cannot be uploaded or processed and accessing existing designs is not possible.
Mitigation
We chose GCP for its excellent reputation in reliability and its track record in resolving issues quickly.
Additionally, we are planning to migrate some services to other GCP regions to increase our resiliency to region-specific outages.
Path to resolution
We rely on GCP technical staff to restore operations as quickly as they can. If core GCP services (like compute and network) are affected, we don't have the ability to work around those issues.
Disruption of service of Cloudflare
Description
We rely on our CDN partner, Cloudflare, to ensure fast global HTTP access to our backend services. Cloudflare operates a highly redundant infrastructure but occasionally there are global issues that could adversely affect Avocode's availability.
Effect
The Avocode service is not operational. New designs cannot be uploaded or processed and accessing existing designs is not possible.
Mitigation
Our ability to communicate an outage to our customers (via email, Intercom chat and our status page) do not depend on Cloudflare.
Path to resolution
We rely on Cloudflare to restore service in a timely fashion. We currently do not have any plan to migrate away from Cloudflare in the event of an outage.
Kubernetes control plane issues
Description
We use Kubernetes as a container orchestrator for all of our services. Kubernetes is complex and we rely on Google's managed service to handle most of the maintenance and operations work of keeping the clusters up and running.
Occasionally, there are issues that occur that prevent deployments from scaling or communicating on the network correctly. These could range anywhere from SEV-1 to SEV-5 - it just depends on the issue and how much of the cluster it affects.
Effect
In a SEV-1 instance, the Avocode service is not operational. New designs cannot be uploaded or processed and accessing existing designs is not possible.
Mitigation
We have monitoring in place to help us identify some of these types of issues. We monitor at the application layer (e.g. number of designs that failed to process) and at the infrastructure layer (e.g. number of 500 errors returned).
This monitoring alerts our on-call team if there is a critical problem.
Path to resolution
Once our on-call engineers are alerted to the issue, they investigate the issue and figure out if it is something that we can fix or something that we need to escalate to Google Cloud Platform support.
Malicious software or other type of security breach
Description
As a internet-based software company, there are many different ways that an attacker could affect our data integrity or operations.
Effect
It depends on the attack.
Mitigation
We have strong security protections in place, as described in Product Security.
Path to resolution
Again, it depends on the attack and the resolution time depends on the complexity of that attack and if we have the skills in-house to handle the resolution.
Database issues
Description
We use Postgres as our primary database for all of our services. We rely on Google's managed database service to handle maintenance and uptime monitoring.
Effect
If database operations were interrupted, the Avocode service would not be operational. New designs cannot be uploaded or processed and accessing existing designs is not possible.
Mitigation
We have monitoring and alerting in place to help us identify if there are any problems in the database. This monitoring alerts our on-call team if there is a critical problem.
Path to resolution
Once our on-call engineers are alerted to the issue, they investigate the issue and figure out if it is something that we can fix or something that we need to escalate to Google Cloud Platform support.
SEV-2 Threats
Disruption of service of Intercom
Description
We use Intercom to provide chat-based and email-based customer support for our users.
Effect
If Intercom were down, customers would have difficulty contacting Avocode support.
Mitigation
For the duration of the outage, we provide our support staff delegated access to the primary support email account so that they can communicate with customers via email. We can also provide a notice on avocode.com to point users to this email address instead of the in-app chat.
Path to resolution
We rely on Intercom to restore service as quickly as they can.
Disruption of normal communication channels
Description
At Avocode, we rely on Slack and email to communicate. We use these channels for normal business communication and also when responding to production incidents.
Effect
A service outage in one or both of these services during a production incident would slow down the identification and resolution of issues.
Mitigation
In addition to Slack and email (hosted by Google GSuite), every on-call team member has the phone number of every other member.
The communication fallback plan is this: Slack → in-person (if in the office) → email → phone
Path to resolution
We believe that the established fallback plan has sufficient redundancy to allow our team to communicate in almost every scenario.
SEV-3 Threats
Disruption of access to physical Avocode offices
Description
Avocode offices could be inaccessible or unusable due to natural disasters, political unrest, utility failures or a number of other reasons.
Effect
If we are unable to use our office for an extended period of time, some business processes that currently rely on face-to-face interaction may take longer.
Mitigation
Our teams can work remotely with the same level of access to our production and administrative infrastructure as they would have in our offices. We authenticate users based on their device and identity, not based on their access to a particular network.
Path to resolution
Our office management team and third parties as necessary work to restore access to the office as soon as possible.