After what has been the longest mail disruption in MIT's history, email service has been fully restored--without loss of saved or in-transit email--to the approximately 4000 email subscribers affected by the crash of one of our central email servers. Although 80 percent of our community continued to have email service during the recovery period, the loss of email service to 20 percent of our subscribers affected the entire Institute. Personally and on behalf of the entire staff of IS&T, I apologize and thank you for your patience and understanding. Our staff worked around the clock for several days to ensure the integrity of all email in the system, but we recognize that the length of this disruption has been totally unacceptable. (For those interested in the technical details of the outage, please see the 3DOWN services page.)
MIT's central email system is quite complex, delivering almost 4 million messages a day to our community of over 20,000 very active email users, as well as preventing the delivery of approximately 10 million spam messages a day. That this was the first general system failure since 2003 indicates both the reliability of our email system and its vulnerability to certain very rare, but not impossible, scenarios. We are currently in the middle of a project to provide redundant email servers, with the secondary system located well away from campus to shield us from still other types of problems (such as NSTAR power outages, one of which hit us last November). This project began in early 2006 and will be complete this summer at a cost of about $4 million. At the completion of this project, we will be able to recover much more quickly from many types of problems, further increasing our reliability.
IS&T is also taking steps to make changes that reduce the number of users affected by any one server problem and also reduce the time needed to check system integrity in the wake of any disruption. We are investigating other ways of improving email service, and of keeping the community informed as to its status. In particular, we need to do a better job of reaching people who are impacted during an outage. In this age of spam, viruses, and other email-borne threats, we recognize how important it is for our community to maintain confidence in our central email service, and we will be working very hard to regain that confidence.
It is the job of IS&T to provide essential services to the MIT community of which we are all a part. The IS&T staff and I take that responsibility very seriously. We are constantly working to do better--to deliver highly reliable service, to recover from problems as quickly as possible, and to keep the community informed at all times. I apologize again for this service interruption. Email is the lifeblood of activity at MIT and we will do everything possible to make sure that it keeps flowing smoothly. As always, we appreciate your input on how we can improve.
--Jerry Grochow, Vice President for Information Services & Technology
A version of this article appeared in MIT Tech Talk on March 14, 2007 (download PDF).