Hosted Scalr service interruption main discussion thread

marc's Avatar

marc

18 Jul, 2016 04:19 PM

Hello all,

We are presently investigating this incident and will be replying with more information as it is available. Status updates will be available in this discussion thread as well as Twitter. Thank you for your ongoing patience while we work to resolve this issue.

Scalr Technical Support

  1. 1 Posted by marc on 18 Jul, 2016 04:32 PM

    marc's Avatar

    Earlier this morning, an update with flawed logic was deployed to Hosted Scalr causing server records to be removed but not actual (AWS, other cloud) servers.

    By default, where Scalr lost communication with a running server (or when the server record is removed), the Scalr desired state engine will take action to launch a new server in replacement of the previous, and then move resources such as EBS volumes and EIP addresses from the old server to the new. As a result, there are two likely scenarios:

    (1) In situations where servers were re-launched, and restarted successfully, including reconfigured EBS volumes, and instances added back to ELB, there should be no further action required, and services should be restored normally. We will follow-up to see if any remaining issues; however, no specific action required at this time.

    (2) In situations where servers were re-launched, but not restarted successfully (e.g. servers still in a failed, initializing or pending state), and/or EBS, EIP, and ELB not configured properly, Scalr is working on individually restoring these particular servers and will communicate any specific issues or information to the users directly. We are going through these on a case-by-case basis to make sure we handle as cautiously as possible, and therefore complete recovery may take better part of day.

    We will continue to provide regular status updates at support.scalr.net (in this thread) throughout the day, and will follow-up, post-recovery, with an Root Cause / Correct Action report.

    Thank you for your patience and assistance with this issue.

  2. 2 Posted by marc on 18 Jul, 2016 05:12 PM

    marc's Avatar

    Hi all,

    Quick note. In the near term we request that users do not manually delete orphaned servers until migration to newly provisioned servers is completed or original servers are confirmed to be restored by Scalr Engineering. This is to prevent any accidental loss of data while we work towards fully resolving this incident.

    Scalr Technical Support

  3. 3 Posted by vito on 18 Jul, 2016 05:17 PM

    vito's Avatar

    Thanks for keeping us posted, communication is key during a time like this.

  4. 4 Posted by Jon Pastor on 18 Jul, 2016 05:42 PM

    Jon Pastor's Avatar

    Hi as part of your fixing, did you stop autoscaling from working? Becuse we got our servers back up (issue was that the farm was configured correctly but the chef server didn't come back up).

    We got our chef server up in a new farm. Now our other farms are not autoscaling so we get get back online as now all they need to do is talk to chef.

  5. 5 Posted by marc on 18 Jul, 2016 05:43 PM

    marc's Avatar

    Note: Scalr Engineering will NOT remove orphaned servers without customer confirmation of issue resolution.

  6. 6 Posted by marc on 18 Jul, 2016 05:46 PM

    marc's Avatar

    Hi John,

    Autoscaling has been temporarily disabled as part of our ongoing efforts to resolve this incident. We expect to have autoscaling enabled once again within the hour.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  7. 7 Posted by John Williams on 18 Jul, 2016 05:50 PM

    John Williams's Avatar

    Is there any timeframe Marc on when these issues will be resolved?

  8. 8 Posted by marc on 18 Jul, 2016 05:56 PM

    marc's Avatar

    Hi John,

    I do not have a solid ETA that can be provided for complete resolution. We are however dedicating all resources towards resolving this ASAP. I will post further updates here shortly.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  9. 9 Posted by marc on 18 Jul, 2016 05:57 PM

    marc's Avatar

    Status update: We have recovered or restored approximately 80% of impacted servers and are working to complete issue resolution for the remaining impacted instances.

    Further updates will be posted here as they become available.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  10. 10 Posted by marc on 18 Jul, 2016 07:47 PM

    marc's Avatar

    Status Update: Auto-scaling as been re-enabled. We are working hard to resolve any remaining issues for customers who are still impacted.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  11. 11 Posted by Vito Louis Sans... on 18 Jul, 2016 07:48 PM

    Vito Louis Sansevero's Avatar

    Marc,

    Any update on the massive amounts of orphaned servers? I have 300+

  12. 12 Posted by marc on 18 Jul, 2016 07:52 PM

    marc's Avatar

    Hi Vito,

    We would like to confirm with you and with Engineering that there are no ongoing issues in your account before proceeding with termination of orphaned servers. Could you submit a new Support Ticket for this so that we can track and address your inquiry individually? I will pick up the ticket and reply once you submit it.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  13. 13 Posted by Jon Pastor on 18 Jul, 2016 08:01 PM

    Jon Pastor's Avatar

    You guys are doing great! On my account, FYI the only farms not green are ones where the server had termination protection enabled. All I need you to do is point back to those old servers and have it stop trying to make new ones. Those old ones are still in the UI with a "can't terminate due to protection" error.

    Would also need the automajic DNS entries to point back to the old ones rather than "deleted" like they are now

  14. 14 Posted by marc on 18 Jul, 2016 08:05 PM

    marc's Avatar

    Hi Jon,

    Could you also submit a quick ticket for this as well? Just want to be sure that we track and address your needs individually.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  15. 15 Posted by Eugene Brodsky on 18 Jul, 2016 08:08 PM

    Eugene Brodsky's Avatar

    Hi Marc - what is the process of re-attaching a Percona master to the farm? currently getting a `403 Client Error:` when trying to start scalarizr

  16. 16 Posted by marc on 18 Jul, 2016 08:10 PM

    marc's Avatar

    Status Update: We understand the issue that orphaned servers pose, but we strongly recommend that all users do NOT terminate orphans until our Engineering team has had a chance to assess and advise each customer individually.

    If you have not already done so, please feel free to submit a ticket so that we may address each account individually.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  17. 17 Posted by marc on 18 Jul, 2016 08:10 PM

    marc's Avatar

    Eugene,

    Replying to you in our external email thread.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  18. 18 Posted by marc on 18 Jul, 2016 10:16 PM

    marc's Avatar

    Final Status Update:

    We have completed automated recovery. All impacted customers should no longer be experiencing issues. If issues persist, please reach out to us in a new or existing Support ticket so that we may host a call and resolve any lingering issues.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  19. 19 Posted by marc on 19 Jul, 2016 04:08 PM

    marc's Avatar

    A new and related issue has resulted in the disruption of a small portion of additional servers. If you are impacted, please accept our apologies while we work to get you back online.

    Scalr Technical Support

  20. 20 Posted by marc on 19 Jul, 2016 04:17 PM

    marc's Avatar

    Hosted Scalr Issue: an additional update

    Last night we experienced a setback as we worked to resolve the final remaining issues associated with the incident that we experienced on Monday morning, as documented here.

    During issue resolution process, a database action was taken that had the unintended consequence of terminating the records of some additional active servers, and subsequently creating duplicate servers. We believe this oversight was caused by fatigue, and will have a definitive answer following the post-mortem.

    Our Engineering team has been working around the clock (36 hours awake) to resolve the remaining issues and will continue to do so until all remaining issues have been addressed. We have organized into shifts to avoid additional mistakes caused by tiredness.

    We sincerely apologize for the additional impact, and will continue to provide updates as we resolve these issues.

    Scalr Technical Support

  21. 21 Posted by marc on 19 Jul, 2016 04:33 PM

    marc's Avatar

    Quick Note: If you are affected by this second and new incident, please refrain from manual fixes. We are deploying a general fix and cannot guarantee compatibility with manual changes.

    Scalr Technical Support

  22. marc closed this discussion on 19 Jul, 2016 11:42 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac