Urgent help - DB Servers terminated and not restarted

Arie Fishler's Avatar

Arie Fishler

18 Jul, 2016 02:04 PM

My database farm servers terminated unexpectedly - Both master and slave. Only one server restarted and it has a failed status.

So I am not sure what's going on and what should be done next.

It seems there is a problem attaching the db volume.

EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>VolumeInUse</Code><Message>vol-4a472303 is already attached to an instance</Message></Error></Errors><RequestID>1bcda86a-f482-4910-8223-53ae9785a6ac</RequestID></Response>

Farm 923

Please advice.

FOLLOW UP:
In a way this looks much more serious and at the same time...not. Looking at the Amazon Console it seems like my servers did not terminate at all. It seems now I have a double amount of servers running. Looks like the Scalr console decided stuuf went wrong and it started restating everything as new servers. This is real bad as the Scalr console does not reflect totally the accurate server state.

It seems all this mess is on your side guys. My production system seems to be live but I cannot see the real status on the scalr console.

  1. 1 Posted by Nir Ben-Dor on 18 Jul, 2016 02:24 PM

    Nir Ben-Dor's Avatar

    SAME HERE ! PLEASE ADVISE SCALR!

  2. Support Staff 2 Posted by Marat Komarov on 18 Jul, 2016 02:59 PM

    Marat Komarov's Avatar

    We're investigating this issue and will provide details later. Right now all our team is focused on service recovery.

    Regards,
    Marat

  3. Support Staff 3 Posted by Marat Komarov on 18 Jul, 2016 04:31 PM

    Marat Komarov's Avatar

    Scalr lost control over this server, and launсhed the new one. Volume is still attached to the previous server i-d7c6a7fc.

    Two possible recovery options:

    • You can terminate server i-d7c6a7fc, and volume will be available for the new server
    • We can terminate a replacement server and recover records for the old one.

    If you haven’t touched old server (volume is still there, EIP is still there, etc.), Option #2 will be the best. In case if you tried to do something manually, It’s better to terminate an old server and let Scalr to initialize a new one.

  4. 4 Posted by Arie Fishler on 18 Jul, 2016 04:35 PM

    Arie Fishler's Avatar

    I haven't touched anything. I would rather go back naturally.

    Thing is there are other servers, other farms. Not DB only. If possible would prefer to go back to the previous state and have all instances launched by mistake terminated.

    Thanks.

  5. 5 Posted by Arie Fishler on 18 Jul, 2016 06:43 PM

    Arie Fishler's Avatar

    When will you revert back to the previous state?

  6. 6 Posted by marc on 18 Jul, 2016 06:57 PM

    marc's Avatar

    Hi Arie,

    Apologies for the delay. This action is presently in progress with Engineering. We will circle back as soon as this is confirmed complete. Thank you for your ongoing patience.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  7. Support Staff 7 Posted by Michael Lochead on 18 Jul, 2016 11:35 PM

    Michael Lochead's Avatar

    Hi Arie,

    As communicated, Scalr experienced a critical error this morning as detailed here: http://support.scalr.net/discussions/problems/26464-hosted-scalr-se...

    At this point, we have completed automated recovery of all impacted Servers. If you are experiencing any remaining impact to your service, please re-open this ticket, or create a new one and we will put you in direct contact with our engineering team to resolve the outstanding issues.

    Thank you for your patience and collaboration in helping to address this issue.  We will send a final report on the problem soon.

    Best,
    Michael

  8. Michael Lochead closed this discussion on 18 Jul, 2016 11:35 PM.

  9. Arie Fishler re-opened this discussion on 19 Jul, 2016 07:28 AM

  10. 8 Posted by Arie Fishler on 19 Jul, 2016 07:28 AM

    Arie Fishler's Avatar

    Hi There,

    You referred only to what I observed initially and mentioned as one farm that had this problem.
    Farms:
    1173
    2587
    3782
    5293

    Did not return to their previous state.

    Also regarding the DB. I see in farm 923 that the master returned to its original server but the slave is a new instance.

    Are you going to handle all this?

    ==================================== Ok - I now went through the issue report and will handle this via the orphan servers page.

  11. Support Staff 9 Posted by Marat Komarov on 19 Jul, 2016 07:57 AM

    Marat Komarov's Avatar

    1173
    2587
    3782
    5293

    We can restore original servers, if they are still running on a cloud https://my.scalr.net/#/discoverymanager/orphanedservers?cloudLocati...

    Please specify farms / farmRoles that you what to restore.

  12. 10 Posted by Arie Fishler on 19 Jul, 2016 08:11 AM

    Arie Fishler's Avatar

    Farm:
    3782 -> Role: platform-tomcat7-ubuntu1404 (2 servers):
    On the orphaned servers page these are: i-bb4f920b, i-e0d10d50 that need to be restored.

    This will do. I will handle all the others.

  13. Support Staff 11 Posted by Marat Komarov on 19 Jul, 2016 08:16 AM

    Marat Komarov's Avatar

    923

    We can restore original slave. Please confirm.

    Also, I noticed, that different volume was manually attached instead of Scalr Master data volume:

    /dev/xvdh on /mnt/dbstorage type ext3 (rw)
    /dev/xvdj on /mnt/dbstorage type ext3 (rw)
    
    Scalr is tracking /dev/xvdh (vol-eba4d3a2) We're going to update our records to /dev/xvdj (vol-4a472303)
  14. 12 Posted by Arie Fishler on 19 Jul, 2016 08:18 AM

    Arie Fishler's Avatar

    Ok, if you can restore original slave as well that's cool. Please do.

    I don't recall any manual volume changing. Is this on the DB farm 923?

  15. 13 Posted by Arie Fishler on 19 Jul, 2016 08:23 AM

    Arie Fishler's Avatar

    Can you also restore farm 5293? One server there only.

  16. 14 Posted by Arie Fishler on 19 Jul, 2016 09:05 AM

    Arie Fishler's Avatar

    All the servers now on the orphaned servers page are waiting to be restored.
    I removed all that was not needed.

  17. Support Staff 15 Posted by Marat Komarov on 19 Jul, 2016 11:01 AM

    Marat Komarov's Avatar

    Arie,

    We've restored all orphaned servers. Please don't forget to terminate duplicates.

    JFYI: af927170-71cb-4648-a970-77a0738efbf4 is Ubuntu 8.04, and Agent don't support it anymore.

  18. 16 Posted by Arie Fishler on 19 Jul, 2016 11:09 AM

    Arie Fishler's Avatar

    Thanks Marat......well done. Appreciate your response as always. Hope you guys can get some rest after this crisis. All the best, Arie.

  19. 17 Posted by Arie Fishler on 19 Jul, 2016 11:37 AM

    Arie Fishler's Avatar

    NOOOOOOOOOOOO.....Marat my DB farm is getting crazy again. It displays wrong servers.

    Seems like the original ones are still live.

    Now Scalr tries to restart 2 servers all the time (probably fails becuase the originals are live)

  20. 18 Posted by Arie Fishler on 19 Jul, 2016 11:50 AM

    Arie Fishler's Avatar

    same goes for other farms you just restored!!

  21. Support Staff 19 Posted by Igor Savchenko on 19 Jul, 2016 01:37 PM

    Igor Savchenko's Avatar

    Hi Arie,

    We're looking into this.

    Regards,
    Igor

  22. Support Staff 20 Posted by Igor Savchenko on 19 Jul, 2016 01:49 PM

    Igor Savchenko's Avatar

    Should be fixed now. Please confirm.

    Regards,
    Igor

  23. 21 Posted by Arie Fishler on 19 Jul, 2016 01:59 PM

    Arie Fishler's Avatar

    Seems to be fine. thanks.

  24. marc closed this discussion on 19 Jul, 2016 04:03 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac