Servers terminated unexpectedly

Stefan Muscat Doublesin's Avatar

Stefan Muscat Doublesin

18 Jul, 2016 02:02 PM

Today at about 12:57, our Scalr farm started acting up, starting with the below errors and having all servers terminated, bringing the live site down, losing a machine completely (the UAT did not get replaced) and with the load balancer (d2ffc02c-0f99-4938-9c37-78631238b15b) stuck in "Initializing" status. Can you let us know what happened exactly and why the load balancer is not starting up?


Jul 18, 2016 12:57:29

37f63136-23df-400f-b4a8-303cec8b905c/FarmLog

Server 37f63136-23df-400f-b4a8-303cec8b905c (ec2) found in Scalr but not found in the cloud. Terminating.

1

Caller: 37f63136-23df-400f-b4a8-303cec8b905c/FarmLog

Message: Server 37f63136-23df-400f-b4a8-303cec8b905c (ec2) found in Scalr but not found in the cloud. Terminating.


Jul 18, 2016 12:57:07

54c8d1b5-a07e-42c3-965f-883d93232e50/Scalr_Scaling_Sensors_LoadAverage

Unable to read LoadAverage value from server #1: 54.229.175.61: Wrong crypto key!

2

Caller: 54c8d1b5-a07e-42c3-965f-883d93232e50/Scalr_Scaling_Sensors_LoadAverage

Message: Unable to read LoadAverage value from server #1: 54.229.175.61: Wrong crypto key!

  1. 1 Posted by Stefan Muscat D... on 18 Jul, 2016 02:21 PM

    Stefan Muscat Doublesin's Avatar

    I have just seen that the issue is pervasive across many customers. and that the servers are actually still existing on AWS.

    Would it be possible to reinstate the previous servers into Scalr again please?

  2. Support Staff 2 Posted by Marat Komarov on 18 Jul, 2016 02:59 PM

    Marat Komarov's Avatar

    We're investigating this issue and will provide details later. Right now all our team is focused on service recovery.

    Regards,
    Marat

  3. 3 Posted by Stefan Muscat D... on 18 Jul, 2016 04:39 PM

    Stefan Muscat Doublesin's Avatar

    We managed to reinstate the production site by bypassing Scalr configurations and processes and setting up manually the elastic ips, the crons, and the DNS resolving.

    Please make sure when Scalr is working again not to interfere with our running servers.

    thanks.

  4. Support Staff 4 Posted by Marat Komarov on 18 Jul, 2016 07:20 PM

    Marat Komarov's Avatar

    We've recovered your farm state in Scalr. Please confirm that you have no more issues with configuration and operations.

    Regards,
    Marat

  5. Support Staff 5 Posted by Michael Lochead on 18 Jul, 2016 11:36 PM

    Michael Lochead's Avatar

    Hi,

    As communicated, Scalr experienced a critical error this morning as detailed here: http://support.scalr.net/discussions/problems/26464-hosted-scalr-se...

    At this point, we have completed automated recovery of all impacted Servers. If you are experiencing any remaining impact to your service, please re-open this ticket, or create a new one and we will put you in direct contact with our engineering team to resolve the outstanding issues.

    Thank you for your patience and collaboration in helping to address this issue.  We will send a final report on the problem soon.

    Best,
    Michael

  6. Michael Lochead closed this discussion on 18 Jul, 2016 11:36 PM.

  7. Stefan Muscat Doublesin re-opened this discussion on 18 Jul, 2016 11:43 PM

  8. 6 Posted by Stefan Muscat D... on 18 Jul, 2016 11:43 PM

    Stefan Muscat Doublesin's Avatar

    I advised you to not touch our servers!! Now our site is back down and we may have lost data.

  9. Support Staff 7 Posted by Michael Lochead on 18 Jul, 2016 11:54 PM

    Michael Lochead's Avatar

    Stefan,

    If you are available now, can I connect you to our Engineering team via Skype to address?

    Skype contact is igorwebta

    Thanks,
    Michael

  10. 8 Posted by Stefan Muscat D... on 19 Jul, 2016 08:18 AM

    Stefan Muscat Doublesin's Avatar

    We are trying to reinstate our farm, but have issues with the load balancer

    Server ID - 01236f11-672c-498d-80d7-b75ac8e204cf
    Status - Failed
    [08:00:06] Create Server record in Scalr
    [08:01:22] Provision Server in Cloud Platform
    [08:00:47] Wait for OS to finish booting
    [08:01:28] Wait for Scalarizr Agent to update and start
    [08:01:56] Wait for Agent HostInit phase to complete
    Wait for Agent BeforeHostUp phase to complete
    /usr/sbin/nginx (code: 1) <out>: <err>: nginx: [emerg] BIO_new_file("/etc/scalr/private.d/keys/dhparam.pem") failed (SSL: error:02001002:system library:fopen:No such file or directory:fopen('/etc/scalr/private.d/keys/dhparam.pem','r') error:2006D080:BIO routines:BIO_new_file:no such file)
    nginx: configuration file /etc/nginx/nginx.conf test failed <args>: /usr/sbin/nginx -t
    Wait for Agent HostUp phase to complete
    Done

    I attached the debug log. Any details please? Please treat this as a matter of urgency as the site has been down since you "recovered" our farm last night.

    If for some reason you cannot fix the issue, please let me know in a timely manner as I took an image direct from AWS and can possibly bring up the site bypassing Scalr.

  11. 9 Posted by Stefan Muscat D... on 19 Jul, 2016 08:19 AM

    Stefan Muscat Doublesin's Avatar

    I don't know if the attachment worked, so I am re-uploading again.

  12. 10 Posted by Stefan Muscat D... on 19 Jul, 2016 08:33 AM

    Stefan Muscat Doublesin's Avatar

    I know the issue seems to be with missing SSL certificate files.

    Please communicate with us explaining the issue and situation BEFORE doing any actions, as in the meantime we have reinstated the site bypassing Scalr services, using a backup of the server I took yesterday, assigning the elastic IP, changing the DNS and manually setting nginx configurations.

    TLDR: communicate with us BEFORE performing any actions

  13. Support Staff 11 Posted by Marat Komarov on 19 Jul, 2016 08:47 AM

    Marat Komarov's Avatar

    Server is starting with a very old agent version. Please login to your server and update agent using this snippet:

    PLATFORM=ec2 && curl -L https://my.scalr.net/public/linux/latest/$PLATFORM/install_scalarizr.sh | sudo bash
    

    After that, let us know and we will trigger initialization once again.

  14. 12 Posted by Stefan Muscat D... on 19 Jul, 2016 09:31 AM

    Stefan Muscat Doublesin's Avatar

    I have updated the agent.

    Can you trigger initialization please?

  15. Support Staff 13 Posted by Marat Komarov on 19 Jul, 2016 09:42 AM

    Marat Komarov's Avatar

    Proxy www.inkinddirect.org has the following line in server template

    ssl_dhparam   /etc/scalr/private.d/keys/dhparam.pem;
    

    This certificate is not managed by Scalr. You should remove this line and re-spawn new server. Or add this certificate by HostInit script.

    Regards,
    Marat

  16. Support Staff 14 Posted by Marat Komarov on 19 Jul, 2016 09:48 AM

    Marat Komarov's Avatar

    If after that server still won't initialize, please again update agent, and bump thread to re-trigger initialization.

    Regards,
    Marat

  17. 15 Posted by Stefan Muscat D... on 19 Jul, 2016 09:52 AM

    Stefan Muscat Doublesin's Avatar

    Understood, it seems this file was added at a later time and is not in the server image.

    I am uploading it from my backup and will let you know to re-initialise.

    Actually could I launch re-initialisation myself?

  18. 16 Posted by Stefan Muscat D... on 19 Jul, 2016 09:54 AM

    Stefan Muscat Doublesin's Avatar

    uploaded. Please re-initialise.

  19. Support Staff 17 Posted by Marat Komarov on 19 Jul, 2016 10:00 AM

    Marat Komarov's Avatar

    Please start agent with service scalarizr start

  20. Support Staff 18 Posted by Marat Komarov on 19 Jul, 2016 10:16 AM

    Marat Komarov's Avatar

    We can't trigger re-initialization for this server. Please install certificate with blocking HostInit script. This should help.

    Regards,
    Marat

  21. 19 Posted by Stefan Muscat D... on 19 Jul, 2016 11:09 AM

    Stefan Muscat Doublesin's Avatar

    in the debug log, it seems to be stuck on this

    2016-07-19 12:06:52,395+01:00 - INFO - scalarizr.handlers.lifecycle - Normal start
    2016-07-19 12:06:52,396+01:00 - DEBUG - scalarizr.handlers.nginx - Handling on_start message
    2016-07-19 12:06:52,396+01:00 - DEBUG - scalarizr.handlers.hooks - Hook on 'start'() {}

    any other logs I can check?

  22. Support Staff 20 Posted by Igor Savchenko on 19 Jul, 2016 11:21 PM

    Igor Savchenko's Avatar

    Fixed. Server was successfully initialized. Please don't forget to take a snapshot to make sure that if you'll replace this server new one will reach running status without any issues.

    Regards,
    Igor

  23. 21 Posted by marc on 19 Jul, 2016 11:51 PM

    marc's Avatar

    Hello Stefan,

    We are closing this ticket as resolved as per the previous comments. If any issues persist or if you have any questions, please reopen this ticket by replying or open a new ticket and we can resume troubleshooting efforts if necessary. Thank you for your patience while we worked through these issues.

    Many thanks,
    Wm. Marc O'Brien
    Scalr Technical Support

  24. marc closed this discussion on 19 Jul, 2016 11:51 PM.

  25. Stefan Muscat Doublesin re-opened this discussion on 20 Jul, 2016 08:15 AM

  26. 22 Posted by Stefan Muscat D... on 20 Jul, 2016 08:15 AM

    Stefan Muscat Doublesin's Avatar

    COuld you tell me what was the exact issue with the server please?

  27. Support Staff 23 Posted by Igor Savchenko on 20 Jul, 2016 05:18 PM

    Igor Savchenko's Avatar

    Issue related to the way how scalarizr was updated. Manual update corrupted configuration that is used for communication encryption. I've updated encryption keys and restarted initialization.

  28. Igor Savchenko closed this discussion on 20 Jul, 2016 05:18 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac