Is maintenance going on?

Have been getting service interruptions and mass disconnection yesterday and today, with users having a hard time to reconnect.

Is the team pushing patches through the system to effect some upgrades or mtce? If that is the case, an warning message might help prevent confusion for the users

Is the team pushing patches through the system to effect some upgrades or mtce

Unfortunately we had some unexpected downtime both yesterday and today. We’ve figured out the cause and have put a patch that should help. We’ve been monitoring it for hours and all seems well, but we’ll keep an eye out for issues.

None of this was planned, quite an un-happy start to 2024… Apologies for the inconvenience.

1 Like

Larger update just went live that should resolve this for good…

Back to the drawing board.

1 Like

Poor Jesse… :) Thank you for everything you do…

Is it related to ciber attacks or other malevolent activity?

Not an attack or abuse. The truth is all too boring but I’ll take some time and explain:

Previously we thought it was due to a large sum of users logging in to a chat, which was strange as we’ve had chatrooms with 10,000s of users all connecting around the same time on custom servers, and this chatroom had only a few thousand at once. But, we added some algorithms to help slow the connections down if there’s a sudden surge, and that seemed to help for a few days.

However, when it happened again today, despite our efforts to mitigate it, we took a minute before resetting everything to get a better snapshot of the problem, and figured out the cause: It’s not so much the # of users coming in that’s the issue, but rather, the amount of data being added in/out.

Effectively, our user list is stored in JSON, like:

{“2”: {“n”: “Jesse”, “u”: “Jesse”, “a”: “avatar_12345.png”, “r”: 4}, “11840”: {“n”: “brk”, “u”: “brk”, “a”: “avatar_98765.png”, “r”: 0}}

So if another user joins, their information (user ID, nickname, username, avatar/profile photo, rank) is added in to the end. When they leave, this object is parsed through, their entry is removed, then it’s stored back in Redis. Same for kicking, banning, changing nicknames, etc.

Normally, this is fine.

Unfortunately, the aforementioned chatroom has their users coming in with very long nicknames, longer than average. Most chatrooms have about 5-15 characters on average for their user list, this one was many times that. And it seems the library we’re using to handle this was not up to the task. As it converted this larger-than-usual JSON to an object, updated the user information, then re-converted it to pass to Redis (our database that stores user info), it ended up slowing down rapidly due to the JSON’s size. Add to the fact that the chatroom had 1000s of folks, meaning this parse/un-parse was happening many times per second, and the process really began slowing down. And when this process slows down, nothing else can really happen. And when nothing else happens, folks tend to refresh, slowing it down all the more.

In fact, earlier when things were rough, if you left the chatroom up for a minute or two, you eventually would get in. It’s just very slow — unacceptably slow — as opposed to being outright down. But that’s a distinction without a difference since users really shouldn’t need to wait a minute or two to sign in.

The good news is, as of writing (10:30 PM my time), we’ve gotten an all-new system working locally to handle users joining and leaving, as well as delivering the list in a way the frontend can parse, using a much more efficient system. Now we must go into all other aspects of the code in which the user list is modified (changing names/photos, changing ranks, nickname color, rank shapes, banning, kicking) and convert that to our new system as well, then do lots of testing to ensure nothing is amiss, then deploy this. We’re working through the night as long as it takes to get this done.

Very sorry to all impacted for the inconvenience. We’re focused on fixing this for good. Stay tuned.

1 Like

Update deployed. We’ll monitor the results over the coming hours/days/beyond.

1 Like

So far, things are looking great with the fix. We’ve had another large turn-out for the aforementioned customer’s chatroom today, with no downtime or stability issues.

We’ll continue working on additional optimizations, and will be deploying those over the next few days, to ensure something similar to this won’t happen in the future.

Thank you to everyone for your patience over the past few days.

1 Like