PDA

View Full Version : Server disconnect problem solved?



InsaneJ
2nd January 2015, 16:12
We've, (well actually you guys) have been experiencing a lot of disconnects from the TFC server the past few days. We may have found a fix for it. Warning: techno babble ahead!


We use Xen to run several virtual servers. It turns out that under certain workloads a bug in the xen-netfront driver is getting triggered. This causes packets to be dropped. When enough packets get dropped and the latency between a player and the server is high enough, players can get disconnected. It appears this can happen to any Linux kernel newer than 3.7. We have just upgraded the Linux kernel on our server to 3.17.7 which contains a fix for this issue.

Please let us know if guys have a better connection with the server after this update.

Nexuni
3rd January 2015, 18:49
No fix yet it would seem...


Still getting time outs, mostly after more than 20 players join the server.


I would recommend lowering the number of possible players, but that would piss off a lot of people. :p

Heptagon_ru
3rd January 2015, 18:55
Two players blamed server broadcasts for timeouts. I just checked: 3 times the timeouts had approx 5:30 minutes between each other, twice occuring right after a broadcast text appears in chat. Probably the broadcast plugin could glitch when number of people is high, about 20?
And yes, constant timeouts today.

Nexuni
3rd January 2015, 20:51
At peak times it's like every 5 minutes...

I just tried to mine... I mined a few blocks -> time out -> half the blocks are reset...

Gets pretty annoying at times. ;)

Heptagon_ru
3rd January 2015, 21:29
It also seems that if I see the broadcast message on the screen, I'm not got timed out, just some lag for like 10 seconds, and then server continues to work, but some people are disconnected, like 3-5.
When no message appears - a timeout.
Just a hypothesis. Don't have enough statistical data.

InsaneJ
3rd January 2015, 22:55
In the server logs I see the following happening:

[21:39:46] [Netty IO #3/WARN]: Selector.select() returned prematurely 512 times in a row; rebuilding selector.
[21:39:46] [Netty IO #3/INFO]: Migrated 2 channel(s) to the new Selector.
And then everybody disconnects. It seems that a workaround for a bug in netty or java (not sure which yet) activates to prevent a crash. That is good, however the result is that the connections are dropped and everybody has to reconnect. Which isn't good obviously.

md-5, the guy behind Spigot, sais there is nothing he can do about it: https://github.com/SpigotMC/BungeeCord/issues/455

On netty's Github people are discussing various Linux kernels but I'm not sure if that is going to be any help to us since we're already on a newer kernel then the ones they are talking about: https://github.com/netty/netty/issues/2616

When we know more we'll post an update.

Nexuni
3rd January 2015, 23:36
I have no idea how to run those servers...

But does it have to be Linux? Couldn't it work with another OS?


As I said... no idea if that is even possible... :confused:

Jiro_89
4th January 2015, 07:05
Sorry about that second announcer. It was an attempt to have a more updated announcer plugin that could replace our last announcer that we removed due to the same issue of short pauses in gameplay. I'd venture to guess that any announcer using a similar system will cause the pauses so we may just have to have plain Jane announcements that don't have fancy colors, but it's better than nothing ;)

Nexuni
4th January 2015, 22:13
It gets pretty much unplayable if many people are online... Mostly from 16 to 24 pm GMT+1 (my timezone). It times out every 5 minutes...

Sverf
4th January 2015, 23:03
Around 22:00 CET I've disabled TCP segmentation offloading and generic segmentation offloading in the virtual network driver of the VM instance, this should work around the problem, but uses more cpu power to transmit packets. Let us know if this makes any difference.

Jiro_89
5th January 2015, 02:32
The timeouts still seem to be occurring.

Jiro_89
6th January 2015, 21:17
Just an updated to frequency: At the moment we've experienced 5 in the last 10 minutes. When the population goes over or around 30 the chance of the timeout seems to increase exponentially.

Sverf
7th January 2015, 10:41
It seems purely related to amount of players on the server, more players, the more it happens.
However the logs have no indication at all as to why. The Netty selector warnings do not happen before every mass DC and with the tweaks to the Linux kernel the xen-netfront issue from the first post is not the cause either. Im not sure where to look further at this time.

One last crazy idea I have was to replace the netty class files bundled with the minecraft jar with a newer version that apparently fixes the netty select bug. No idea if this can be done however..

Sverf
7th January 2015, 11:01
Browsing through last nights log (15:20 CET yesterday - 9:00 CET today) we had mass disonnects (more then 10 players dc'ing) at:

15:28:04
16:39:23
18:12:51
19:09:39
19:16:07
19:33:13
19:39:40
19:46:21
19:52:32
19:58:43
20:04:24
20:10:09
20:15:56
20:21:50
20:27:57
20:33:53
20:45:15
20:51:03
20:56:50
05:03:12
05:09:19
05:15:25
07:13:10
08:43:24

Between 1900 and 2100 CET it was particularly bad.

InsaneJ
7th January 2015, 11:53
We're still running Java 7 right?

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Perhaps we could try running Java 8 first? I ran a local instance of the server on my PC using Java 8. It starts up normally as far as I could tell.

Sverf
7th January 2015, 17:28
We can switch the default java implementation per useraccount, so yeah, we sure can test the TFC server with java 8.

InsaneJ
18th January 2015, 00:33
To try to keep things running smoothly we are going to enforce some restrictions. These restrictions are on a per town basis. Meaning each town may have:

4 mature animals of each kind. This means: 4 cows, 4 pigs, 4 horses, etc. All mature, not counting young animals.
1 plot of berry bushes. This means 16x16 = 256 bushes. They may be in more than one plot, but 256 max.
2 plots of tilled soil farmland. This means 512 blocks. They may be in more than two plots, but 512 max.
1 of each kind of fruit tree. (red apple, green apple, banana, peach, etc.)
1 of each kind of anvil. (copper, bronze, wrought iron, steel, etc.)

Again, to be absolutely clear. This limits are per town. If your town is very large and isn't getting enough food, please let us know. Also if anyone needs help removing large amounts of tilled soil, fruit trees or berry bushes you may contact staff to help you.

Now for some technical stuff explaining why and how we came up with the new restrictions.
We have made quite a lot of progress since the last update. We used Warmroast, a Java profiler aimed towards Minecraft, to track down what is keeping the server occupied. It seems that during a mass disconnect the server is busy processing a lot of tile entities. The server has upward of 15.000 tile entities loaded during peak hours. In the screenshot below you can see that net.minecraft.world.WorldServer.func_72939_s() is using 82.79% CPU time during a mass disconnect on the server. It looks like it's going through a very large array and deleting items from it. If this takes too long (1200) it dies, which seems to be when the mass discoonect occurs.
http://happydiggers.net/attachment.php?attachmentid=1563&stc=1
Unfortunately this is a vanilla Minecraft function. This means that only Forge could potentially create and optimization for this to better deal with large amounts of tile entities. But even if they did, it would mean a new version of Forge which we can't use since Cauldron, which our server runs on, isn't being maintained any longer. Cauldron is essentially Minecraft_server + Forge + Bukkit + magic glue code.

So that leaves us with two possible options. Perhaps the TFC devs could do something to optimize server performance. Or we try to reduce the amount of tile entities on the server. During peak hours we've seen that the majority of tile entities are made up of farmland and fruit trees. Below is an example of this:
http://happydiggers.net/attachment.php?attachmentid=1564&stc=1

Going through what the server is doing we discovered that about 10.000.000 blocks are formed on the server each hour. That's about 2700 blocks each second. These are most likely snow blocks and the like. However each of these blocks forming meant one more record in the Prism data base which logs everything that happens on the server. We have change the Prism configuration to no longer include these naturally forming blocks which aren't necessary for grief prevention and roll-back. This saves us a considerable database overhead.

We have installed a mod that optimizes performance. It's quite a scary mod to run since it changes a lot of the internal workings of the server.This is the summary:

Ticks multiple worlds at the same time in different threads
Ticks multiple entities and tile entities within worlds at the same time in different threads
Performs movement updates in network threads, so players will move smoothly even if TPS is bad
Performs chunk loading and chunk generation asynchronously - no lag spikes from players logging in or teleporting
Improved collision code which can handle thousands of mobs in a single block space without terrible TPS. Will still break clients.
Many more small tweaks to minecraft's internals and other mods, both to improve performance and ensure that multithreading doesn't break everything.