As many of you experienced, CoursePlus was incredibly slow to the point of being unusable on Sunday, July 24, from approximately 5pm until 1am on Monday, July 25. While there were periods during this time when the site responded normally and some people had no problems, "slow and painful" was the most common experience during this time. This isn't normal, and is unacceptable to both you and us. I'd like to explain what happened and how we are addressing the problem.
In working with JHSPH IT on the cause of the problem, we found that there was a cascade of database deadlocks on the database tables for the peer assessment system in CoursePlus. These deadlocks prevented data from being written to the tables. As requests to write data to the tables queued up, it backed up requests to write to the database cluster as a whole. More requests caused the database to not be able to release resources across the database cluster, thus causing more deadlocks at the heart of the problem. This, in turn, caused CoursePlus to slow down to the point of being unusable for long periods of time. As people gave up on accessing CoursePlus because of the slowness, things would speed up again for a bit, only to ultimately slow down again when the database deadlocking resumed.
(As a side note, the brief period where CoursePlus went unresponsive today, July 25, was caused by over-eager database server monitoring while investigating this problem. That period lasted from 11:45am until 11:57am.)
After performing the forensic analysis which revealed the source of this problem, we've done the following:
- Retuned the underlying database design to improve data lookup performance
- Removed some locking queries which could potentially result in database deadlocks
We've already put these changes into production and are monitoring the database to look for any sign of continued problems. We're optimistic that these changes will make a big difference in preventing the occurence of the site-wide problems we had last night.
In addition to thanking all of you for your patience, I specifically want to thank all of you who contacted CTL Help, sometimes with great anger and sometimes with great humor, to let us know about the specifics of your problems. The detail you provide to the CTL Help team is invaluable, and is always a better and faster way of getting problems with CoursePlus solved than by contacting your course facuty or TA.
We know that this event was incredibly frustrating. We know that it eroded the trust that you have that CoursePlus will record your peer assessment or quiz answers correctly. Maintaing that trust is paramount to us.
We'll keep monintoring the database and will continue to make refinements. As always, please contact CTL Help if you have any ongoing or new issues with CoursePlus!
Posted by Brian Klaas at 3:18 PM - Categories: CoursePlus