CEO John Baker: You Have Every Right to Be Angry
COO Dennis Kavelman: Information About Recent Service Disruption
Added 2/4/2013: Some Responses to the Responses
Note: comments on this post are moderated. Comments containing profanity or personal attacks will not be posted.
Update: comments have now been closed. Pretty sure we’ve covered all the ground at this point. Facebook and Twitter remain open.
Original post starts here. From Wed. Jan. 30, 2013
I arrived in Kitchener at about 10 PM on Tuesday, January 29. I had a lousy day of travel and arrived about 7 hours later than planned. During those 7 hours I missed much of news of the performance problems that were impacting many Desire2Learn clients. Let’s just say that many of the D2L-hosted clients were having some pretty big problems.
As I walked into the office this morning at 8 AM, John Baker pulled me into a meeting with several others to get the lay of the land; which includes a briefing on the ongoing plans to fix the issues and writing communications to the clients informing them of the problems and the progress. A critical piece of hardware is not performing as promised, and this has created additional issues that require attention from the hardware vendors as well as the D2L team.
CEO Baker and several other people were working on 0-3 hours of sleep. Phone calls to vendors at 2:00 AM are not good ways to start/continue/finish your day. Never for a second should you think that they don’t take this stuff seriously. As this professor says, people depend on our service and expect high levels of uptime and performance. And they should.
Communications were going out to the service contacts at each impacted client every couple of hours. Still, until the solutions are fully implemented, it’s impossible to send the message that customers really want to hear (It’s fixed!). Here is an example of one of those updates:
10:30am ET Update:
At this time we would like to provide a bit more visibility into what we have been experiencing over the past 24 hours.
As part of our major investment in next-generation infrastructure project within our SaaS facilities we began in the fall of 2012, many changes to our environment were required as part of our migration into this new environment.
One of these changes required a sophisticated process of migrating data to our new enterprise storage solution. We made a decision that it would serve our clients best to migrate this data over time, with the help of our vendors using technologies purpose-built for live migration. This methodology prevented the requirement for long, multi-day maintenance windows due to the large volumes of file data that need to be transferred. Effectively this “file virtualization” technology (“ARX”) would allow the seamless use of both source and destination storage during the migration with no impact to users.
The issues currently being experienced have been determined to exist within the ARX technology. We are currently seeing different impact to different customers. For customers whom we had yet to begin the migration of their data, or for customers for whom we had completed the process, we were able to remove the ARX solution from their environment, resulting in a complete restoration of service.
For customers who are in midst of their data migration, the ARX cannot be removed, and we have initiated a separate restoration process that applies to a portion of customers. This process involves a configuration change to the internal format of metadata within the ARX. This change has shown to have a positive impact on the clients for whom this process has completed. However, this configuration change takes time to process, and we are targeting noon EST today for completion with clients seeing ongoing improvements as it makes progress.
There are a small group of customers who are affected by one additional issue that was caused by a recommended course of action that did not produce expected results. We are currently in the midst of backing this change out and a new course of action is being considered. We will contact these affected customers directly.
We will be clarifying potential resolution times specific to each client, in our next update at noon.
That may be more information than some clients want or need, but we’d rather share too much rather than too little information. During the entire day that I’ve been sitting in on the conversations and strategy sessions, it became obvious that a large number of people are actively and directly working on providing a solution. Almost all of the executive team is deeply involved in the action – and it strikes me as more than a little bit amazing that these people all have the technical chops to understand and deal with the problem. Still, a great deal of the detailed work is being done by the Support/IT Staff, since they are the ones who are migrating databases and metadatabases, and they are the ones who are bypassing this or reconfiguring that. The account managers are working hard to provide a communications link to the impacted clients.
Everybody is working toward a solution. Nobody takes this lightly. It’s all hands on deck.
NOTE: The section below was updated periodically throughout the outage. It was posted at the top of this page. Moved to bottom of page on 2/5/13.
Last update 6:30 PM, 2/1/13
All organizations have been fully verified and brought back up to operational status. Support staff will continue to monitor system performance, which is currently running smoothly under a normal load. We will continue to fine tune during normal maintenance periods.
3:30 PM, 2/1/13 – The last couple of client sites are going through final verification at this time. All other sites have been brought up.
2:00 PM, 2/1/13 – Many more client installations are back up and running. For the remaining clients that are still down, much of the needed work has been completed. A rigorous verification and testing process must be completed before the sites are brought back up. We have been in contact with the designated person for each account. They will be notified again as soon as their site is ready to be brought back online. Next update expected in 2-3 hours.
8:35 AM, 2/1/13 – Approximately half of the impacted organizations have been restored to full service. Good progress is being made on the other half of the sites. No definite time can be given for completion for any one site, but we will try to keep you posted. Besides returning all organizations to full functionality, another major goal has been to ensure no loss of data. To ensure that no data is lost, the process is slower than taking a more risky approach. BD
4:35 AM, 2/1/13 – The recovery process for the impacted organizations is a case of site-by-site data migration off of the failed system. This process continues since not all sites have been migrated at this time. Each site is notified when their migration process is complete, and you will receive notice from them when it is back up, although you will probably know it before that when your site becomes accessible once again. No completion time estimates can be given for any individual organization. Will update again if additional information becomes available. BD (italics added at 8:35 for clarification)
Quote below from CEO John Baker, 1/31/13
As many of you are painfully aware, Desire2Learn is experiencing significant challenges in one of its cloud data centers that is dramatically impacting some students’ online experience. This stems from the file virtualization hardware not interacting well with the storage environment. Please know that we are doing everything within our power to fix this problem. We understand the severe impact these service problems have on students, faculty members and administrators. We have devoted every available resource, including those of our partners, to achieve a resolution.
This is our highest priority and we will not rest until it is corrected.
President & CEO
NOTE: Help keyboard image is public domain and found at Pixabay