My website, www.lumacraft.com, was down from early morning through mid-afternoon today. I asked my service provider, Netfirms.ca, for an explanation, and for confirmation that corrective action would be taken to prevent recurrence. Here was their response:
"If being online is that critical for your site(s) - you may want to put some seriuos (sic) thought into upgrade your hosting up to a vps (virtual private server). VPS's are more stable than shared hosting but are also self managed (you would be doing all of the maintenance on the server)."
I sent this reply:
"Uptime is indeed very important to me, but I have to balance cost and benefit, and mine is a small business. From what I can see, VPS service costs many times more than what I am currently paying.
In my five years with Netfirms, I have suffered only two or three outages of today's duration.
However, I have noticed more outages during the past few months than I have previously. (All previous outages during the past few months have been much shorter than today's.)
If Netfirm's uptime guarantee is 99.9%, I believe that works out to 9 hours downtime per year. I can live with that level of outage in a 365 day period, based on what I am paying. But since we have exceeded 9 hours today alone, I hope that you can understand my concern. My questions regarding cause and corrective action were meant to be an opportunity for Netfirms to restore some confidence."
I have yet to receive any further information from Netfirms. If and when I do, I will update this posting.
UPDATE 10-Feb-2013: A support technician at Netfirms responded:
“Unfortunately we do not have logs of what could have happened to your pointer. However if for whatever reason this does happen again, it is a quick fix that you could do through navigating to Domain Central from your control panel. In domain central simply click on your domain to get the boxes below it and select pointers, and make sure that your domain is pointing to the correct subdirectory.”
That did address my questions regarding cause and corrective action, and so I thanked Netfirms for it. But these are not the sorts of answers that I was looking for.
From a nuts and bolts point of view, it’s helpful to know how fix that pointer. Even if I did not already know how to use the control panel, which I did, the answer fails to address the root of the problem that I experienced. The problem had nothing to do with control panel operation. The problem was that somehow that pointer broke! My site was working fine until about 6:30AM yesterday. And then it was down for nine hours. Why? What is in place to prevent it from happening again?
The really troubling piece of information in Netfirms’ reply is that they do not maintain logs that would help them to diagnose the cause of a nine hour site failure like the one that I experienced yesterday. To quote Lord Kelvin, “If you can not measure it, you can not improve it." In this case, the question is not whether Netfirms can measure it. Of course they can. The question is why in heaven’s name are they not?
UPDATE #2 10-Feb-2013: In parallel with the dialogue that I opened with Netfirms Technical Support, I had contacted Netfirms Customer Support for details on Netfirm’s uptime guarantee. That thread is getting interesting too. Here is my latest update to it:
"Thank you for your interest in my concern, (rep's name).
When I switched to Netfirms for web hosting service five years ago, Netfirms promoted a 99.9% uptime guarantee. Based on your colleague's response earlier in this thread, apparently, without my awareness or knowing consent, Netfirms has reduced its reliability commitment to 1/10 of that level, to just 99% uptime, while at the same time, the fees that I have paid have remained essentially unchanged.
Yesterday my site, lumacraft.com, was down for over nine hours. It went down sometime before 6:30AM EST, and service was not restored until about 4:00PM EST. With a 99.9% uptime guarantee, that would be more downtime than I should expect in a year. However, if the reliability level that I can expect is in fact only 99%, then yesterday's downtime may be only a drop in an annual 88 hour bucket of downtime.
Here is a log of other website service outages that lumacraft.com has recently experienced:
January 9, 2013 06:19
December 31, 2012 00:34
December 15, 2012 17:27
November 22, 2012 15:21
October 16, 2012 12:43
All of the above outages were under an hour, and unlike yesterday's outage, in each case service was restored without my intervention.
When I asked Netfirm's tech support about the cause of yesterday's outage (ticket 106xxxxx), the response was first, that I should consider VPS service, and later, "Unfortunately we do not have logs of what could have happened to your pointer." That remark was particularly troubling to me. How can I expect reliable service from Netfirms, if Netfirms cannot and does not routinely monitor and investigate lengthy client service outages, such as the one that I experienced yesterday?
To summarize, here are my questions:
1. Has Netfirm's stated uptime commitment for the web hosting service that I subscribe to dropped from 99.9% to 99% since I moved my site to Netfirms five years ago? If so, why did Netfirms decide to reduce its reliability commitment to me by a factor of 10X?
2. I have noticed six service outages in the past four months, culminating in yesterday's nine hour outage. Is this an acceptable level of reliability, in Netfirms’ option? If it is not, what is the root cause of the problem, and what steps is Netfirms taking to address the root cause? (The above mentioned tech initially responded to similar questions by advising me that the pointer from my domain name to the correct folder on the server had been lost, along with some advice on DIY control panel operation. While informative, that answer avoids the only questions that really matter to me, which are why did the pointer break in the first place, and why should I be confident that it won't happen again.)
I look forward to your reply.
David A. Gilmour