Incident Report for March 20 Downtime

At 7:40pm NZT today, all Path of Exile service was disrupted. While downtime does happen from time to time, this level of disruption is not acceptable.

At 7:45pm NZT, our sysadmin Thomas was alerted by our automated monitoring tools that the realm was experiencing issues. Unfortunately he was away from a computer at the time and was unable to respond immediately. Normally, both Thomas and I are both available to respond to server incidents so that one of us will always be available. Unfortunately, I am currently in the US at the Game Developers Conference and thus did not get server notifications.

At 8:00pm NZT, we were notified by our support team that the realm was experiencing issues and that Thomas had not responded. I began the process of attempting to diagnose the issue. According to the logs all game instances on every server were crashing on startup.

At 8:11pm NZT, I made the call to attempt a realm restart to see if it would resolve the problem. This restart completed at 8:15pm NZT, but did not resolve the issue. I continued to investigate.

At 8:23pm NZT, I finally discovered that the problem was a malformed spam list update that had been pushed to production. A mistake was made in the formatting of the file and this crashed the server when the file was loaded.

By 8:27pm NZT, we had we had made the needed changes and began pushing it to production.

At 8:30pm NZT, the update had been pushed and the problem was resolved.

While this incident was caused by human error in formatting a file, it both should not have been able to occur in the first place and our response to this incident should not have taken the length of time that it did. I will outline the steps that we will take to prevent another incident like this occurring again.

The first and most obvious problem is that our response time on this incident was too slow. This is because I was away and Thomas happened to be unavailable. When I was able to respond, I was doing so on a hotel internet connection in the US tunnelling through our office in NZ and then back to the US. Not the best situation for debugging problems. Fortunately, we have already recently hired a new system administrator who will be joining us in a few weeks. This will mean that we can always have at least two server admins on call at all times, even when I happen to be travelling.

The second problem is that a file was put in to production without first being test loaded. This is a process problem that should not have occurred. Normally any file that is on production will be test loaded to verify that it doesn't cause any issues. Updating the spam list is a feature that was implemented outside our normal asset testing pipeline and so didn't receive the rigorous testing that we normally do on assets. From now on, spam list updates will first be tested before being deployed.

The third fix is for the actual crash in the loader for the spam list that caused the problem.

I am very disappointed in myself that this incident was allowed to occur. We didn't have the staff required to cover for my absence, and I would like to apologise to all our players for the inconvenience caused. This is not the level of service you should expect and the process changes we are making will prevent an incident like this reoccurring in the future.
Path of Exile II - Game Director
Last edited by Jonathan#0000 on Mar 20, 2014, 4:36:19 AM
PSt...the date in the title is wrong :P

Thanks for the report though...much appreciated
Ancestral Bond. It's a thing that does stuff. -Vipermagi

He who controls the pants controls the galaxy. - Rick & Morty S3E1
Jonathan really you don't need to apologize n.n
Dys an sohm
Rohs an kyn
Sahl djahs afah
Mah morn narr
"
lagwin1980 wrote:
PSt...the date in the title is wrong :P

Thanks for the report though...much appreciated


All times and dates are in NZ time by our convention.

Edit: But I got the month wrong. Duurrr.
Path of Exile II - Game Director
Last edited by Jonathan#0000 on Mar 20, 2014, 4:36:41 AM
February is the new March!
Ill make this short and simple,

Jonathan, Shit happens, and you guys still have unfathomable up time on the servers so it's ok! Don't be so hard on your self!

I'm sure all PoE players agree with my statement.
Twitch.tv/Nithryok
TL;DR : File was Vaal Orb'd and OP regrets it.

RNG is RNG

Spoiler
Seriously no big deal, I'm at work anyways. Thanks for the info Jonathan. However if you still feel the need to make this up to me add an .ini I can edit outside of the game files for loot colors and loot filters. ;)


Also, my internet that I pay $55 a month for goes down more than POE does!
Last edited by Worldbreaker#6569 on Mar 20, 2014, 4:52:49 AM
"
Nithryok wrote:
Ill make this short and simple,

Jonathan, Shit happens, and you guys still have unfathomable up time on the servers so it's ok! Don't be so hard on your self!

I'm sure all PoE players agree with my statement.


Most people with sense do agree with that statement....unfortunately there are still plenty that feel that they are owed an apology.
Ancestral Bond. It's a thing that does stuff. -Vipermagi

He who controls the pants controls the galaxy. - Rick & Morty S3E1
At some point where PoE goes a month without any kind of downtime, I want to see a new thread where Jonathan boasts the continued uptime of the servers and takes responsibility for the success of a major accomplishment. Because it shouldn't be that, every time a certain GGG member creates a new thread in GD, you kind of worry if he's about to wander off and commit seppuku.

Seriously, Jonathan. It's okay.
When Stephen Colbert was killed by HYDRA's Project Insight in 2014, the comedy world lost a hero. Since his life model decoy isn't up to the task, please do not mistake my performance as political discussion. I'm just doing what Steve would have wanted.
Last edited by ScrotieMcB#2697 on Mar 20, 2014, 4:50:03 AM
I love these post-mortem reports! The speed of the reports and the frankness of the problem and the solution is always refreshing to see.

Report Forum Post

Report Account:

Report Type

Additional Info