Our plans and pricing are changing on 1st November. Learn more...

AWS Downtime Incident Report 25th August

26 August 2020

AWS experienced some downtime on the 25th August 2020 that had a big impact on Alertatron. This report will explain what happened, what steps we are taking to reduce the impact of a similar problem and some options for our customers.

What Happened?

AWS (Amazon Web Services) host all of Alertatron's services, and on the 25th August 2020 they experienced a loss of power and network connectivity in one of their data centers that some of Alertatron infrastructure is installed on.

This had an immediate impact on Alertatron, and in the hours that followed, while AWS restored power and other services, Alertatron suffered from some significant disruptions to our service that directly affected some trading bots and our ability to process incoming alerts.

Below will be a breakdown of what happened and what impact that had on Alertatron customers, along with actions we will be taking to prevent a repeat of the same problem where that is possible.

The outage, as describe by Amazon:-

The latest update on the outage is as follows:
Starting at 2:05 AM PDT (9:05 AM UTC) we experienced power and network connectivity issues for some instances, and degraded performance for some EBS volumes in the affected Availability Zone.
As a result of this we also experienced elevated API latencies for some of our APIs, elevated API failure rates on the CreateSnapshot API, and elevated instance launch failure rates in the affected Availability Zone.
By 4:50 AM PDT (11:50 AM UTC), power and networking connectivity had been restored to the majority of affected instances, and degraded performance for the majority of affected EBS volumes had been resolved.
While we will continue to work to recover the remaining instances and volumes, for immediate recovery, we recommend replacing any remaining affected instances or volumes if possible.

A little background on our infrastructure: Alertatron is distributed across 3 data centers (availability zones in AWS speak) in Europe. Our systems are designed to be resilient in when failures occur and should generally be able to cope with 2/3 of our infrastructure going offline at once and still keep going. We operate with a lot of redundancy and a lot of spare capacity.

This incident is the first unplanned downtime in Alertatron's history. We'd like to thank all our customers for their patience and support while things were on fire. 🙏

Incident Timeline...

9:05 am UTC 25th August 2020

Power fails in one of the 3 data centers our servers are hosted in.

About half of the servers in the affected data center instantly went offline due to the power loss. Servers in the other two data centers were not affected by this.

Our monitoring and alerting systems notified us immediately of the problem and we investigated. It quickly because apparent that some of our servers were not responding, so we marked all those bots as unavailable so that new requests would not be allocated to a bot that was not working. This largely resolved the situation for customers that had not lost their active bot in the power outage.

11:50 AM UTC

Amazon says power has been restored and most system are back

We still had some servers that had not come back (Amazon later admitted they could not be recovered and should be replaced, which we did). We were also starting to notice some problems with alert messages being delayed and bots not behaving correctly as a result.

For the next 90 minutes we tried to find the cause of the problem that was blocking and delaying alerts. Alerts would arrive normally and appear in your Inbox as they should, but were not processed beyond this, so they did not get forwarded to Telegram, or executed on the bots.

This processing of alerts (and other tasks) are handled by an army of 'workers' that process the jobs (such as 'send this alert to Telegram') from a queue of pending tasks - this allows us to distribute the work and handle sudden peaks in activity without any problems, but today the jobs were piling up in the queue and workers were not picking any of them up.

14:15 UTC

Cleared the Queue

It was clear people did not want trades sent 90 minutes ago to suddenly start executing, so we took the decision to clear the queue. This resolved the jam, and the workers started to process new incoming alerts again and it appeared that the issue was fully resolved.

Everything operated normally again until around 18:23 UTC

We adjusted the queue configuration to prevent very long overdue tasks from being processed.

18:23 UTC

Jobs stopped being processed again

All the Queue Workers had stopped processing again and incoming alerts were no longer being sent to bots or forwarded to other destinations again.

We continued to try and understand what was blocking them.

20:12 UTC

Source of the blocked messages identified.

Some configuration had been corrupted by the earlier power failure that was causing a long chain reaction that eventually caused all the workers to be stuck waiting for a response from one of the servers that was lost in the power failure. It happened like this...

  • Due to some broken configuration, a rare set of events would trigger a job to query the status of one of the bots on one of the servers that was no longer present (lost in the power failure).
  • As the server was no longer there, it never responded and the worker waited patiently forever for it to respond. This queue worker was now locked up and would never ask for new jobs to process.
  • The queue service noticed that this job had not been completed by the worker and assumed something had gone wrong after a few minutes, so sent it to the next worker that asked for something to do.
  • The second worker now suffered the same fate as the first one, waiting forever for the missing server to respond.
  • Gradually over time, all the workers got tied up in this fashion. Restarting them typically just had them pick up the same bad job fairly soon and get locked again.
  • Once all the workers were locked up, no more jobs could be processed and they just backed up.

When we had identified the problem, we were able to fix the configuration and restart the workers so they would no longer get blocked and all systems were returned to normal operations.

Actions to prevent this happening again...

Here is what we will be doing to ensure this won't happen again.

  • 🟢 [DONE] Update the queue service to not keep unprocessed jobs forever. This did not make sense in Alertatron. Jobs that have not been handled after 10 minutes are now moved to a 'failed' queue.
  • 🟢 [DONE] Workers will no longer wait forever for a job to complete. They will now wait a reasonable amount of time and if the job has not finished within the time limit, it will fail it.
  • 🟢 [DONE] Failed jobs will not be re-tried many times. If they fail twice they will be moved to the failed list and not retried again. This, combined with the previous change, will ensure that no job, however bad, can take out all the workers - at worst it will take out 2 workers for a short period of time (2 is only a small fraction of the number running at any time).
Other Steps
  • We are unable to prevent power failures at the data center. As this is a very rare event, we have no plans to migrate away from AWS at this point.
  • With the improvements listed above, this incident would have only affected customers with a bot on a server that lost power (about 20% of bots on this occasion) and would have been over in about 15-20 minutes (even while the power issues continued at AWS) and had no impact on the majority of customers. While this would have been a problem for those customers affected, there would not have been any extended issues and it would have been easier for those customers to recover and repair.

Customer Suggestions

We received a number of suggestions from customers during the incident that were very welcome and we'll be working on these improvements...

  • Have a separate 'notifications' support channel, so we can broadcast details of incidents without the information getting lost in the chat stream.
  • Email notifications about incidents if they should occur. We plan to add this as an option in the notifications settings, so you can choose to get an email to let you know if something is wrong.
  • Add the ability to pause API keys and the trading bot as a whole in your account. This would have removed a lot of stress for customers that were worried old alerts were suddenly going to turn up and start running trades at the wrong time.
  • A more detailed status page that covers more parts of our infrastructure (not just the web site). Our existing status page just provides information about uptime for the web site, and does not give details about the status of the trading bots.

Some images from Unsplash

Recent blog posts...

Blog

New Plans in November

At the start of November we will be changing our plans and pricing. Here are the details...


Flynn the bot
  • 28 September 2020
  • by admin
Blog

Some Changes to our Plans

As Alertatron has grown and we’ve added more and more features, our pricing has always stayed the same.


Flynn the bot
  • 08 September 2020
  • by admin

Back to the blog

About Alertatron.com

Your TradingView alerts, with automated trading, with charts, PM'd to you.
Auto-trade on any alert.

Automate Trading

Trigger orders on popular cryptocurrency exchanges and trade 24/7. Access powerful algorithmic orders to maximise your profits.

Capture Charts

See a high quality chart with every alert, captured at the moment the alert was triggered. See exactly what the market was doing and make informed decisions.

Integrations

Connect your alerts to Telegram (channels, groups and private chats), Discord, Slack, email or send push notifications to other services with webhooks.

Easy To Setup

Nothing to install and no browser extensions. Captures alerts 24/7, even with your computer switched off. Be up and running in seconds.