WMSPanel failures status

This pages keeps track of infrastructure failures of WMSPanel cloud service.

We believe that being transparent about our problems and their solutions is the best way to make sure our customers are confident about the future of our products and services.

October 25, 2018 - Rackspace load balancer network issue

On October 25th WMSPanel support detected packet loss from all our Rackspace servers to ObjectRocket database service. Rackspace has pulled in additional resources and are all hands on this to resolve this ASAP.

They managed to isolate the issue on the load balancers layer of DFW3 region and were actively working on remediating the issue to resolve full connectivity.

You can read issue description at ObjectRocket status website.

WMSPanel application layer handled this infrastructure failure. There was a brief interruption in Web UI and API access, also some accounts were missing retrospective stats on the main dashboard during that malfunction. Daily stats were not affected however.

What has been done?

Once the balancers were back, WMSPanel users could access the panel and check stats.
However, we've made a few additional enhancements to make this kind of outages less impactful on user experience.

October 19, 2018 - ObjectRocket database failure

WMSPanel team uses one of the most reliable hosting providers - Rackspace. We rely on their several services like Rackspace cloud hosting for handling applications layer and ObjectRocket for handling MongoDB database which is used for processing and storage of statistics and servers settings. Currently WMSPanel infrastructure relies on one of DFW datacenters located in Dallas, Texas.

On October 19th, Rackspace planned to run the migration from DFW1 to DFW3 datacenter. The expected downtime is equivalent to an election/stepdown (15-60 seconds). After cutover would have been completed, ObjectRocker was supposed to let us know to update our connection strings for MongoDB.

The problem occurred with the migration itself - the instances took longer to finish the maintenance. The process used for migration was not a dump/restore, but rather a sync data from DFW1 to DFW3. ObjectRocket team programmatically moved over the config servers, then began a sync from old instance to new one.

So having connected to the instances, our application servers couldn't find data about active servers in respective DB collection. According to our business logic, when server sync up to WMSPanel and its ID cannot be found in our database, it's considered as removed from account control panel. In this case WMSPanel responds with proper command which tells the server to stop sending sync-ups.

The database didn't get servers list for about 10 minutes. Each server sends sync-ups each 30 seconds, which means that all the active servers received "stop syncing" response and had switched off from WMSPanel.

By the time was discovered, the database was populated already, so every server which tried to sync up with WMSPanel after that time, could work with WMSPanel successfully. This is why we asked all of our customers to restart their Nimble Streamer and Wowza server instances so they could appear in clients' accounts at the first sync.

What has been done to prevent this in future?

Of course, right after this huge failure, we made proper fixes on our application layer which now prevent WMSPanel from immediate sync stop if the server is not in our database, but make several different checks before doing that. In addition to that, we have automated alerts which notify our support team in case of irregular amount of servers are being shut off.

These additional procedures give us more confidence for better user experience even in case of some database un-integrity.