Page 1 of 3 123 LastLast
Results 1 to 10 of 27
  1. #1
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21

    Server hardware stuff

    In this thread I'll try to keep track of what goes on with the server hardware wise.

    The last publicly documented server change was this: [completed] New server plans (These are no longer 'new' plans)
    We upgraded the server to a 6-core/12-thread Intel Core i7 5820K CPU with 64GB of RAM.

    Unfortunately that setup gave us trouble running VMWare ESXi which resulted in an unstable server. It took a few weeks to track down the exact cause. The problem was with the CPU. If you ever do any PC building: it's almost never the CPU that's causing stability issues. Unless you do overclocking. But that isn't the case here.

    I decided to upgrade the CPU to a 14-core/28-thread Intel Xeon E5 2680 v4 CPU with 96GB RAM after that. This is what we are currently running all our servers on. That took care of the stability issues.

    Then the next issue was with server performance. Or to be more precise: disk performance. The servers are running of two 7200 rpm SATA drives. And even though they are connected to an Areca 1680i raid controller with 4GB cache and a dual core 1.2GHz PowerPC cpu, it's not enough to run all the additional servers we're now running. It used to be:
    • website
    • email
    • Minecraft, 10 or so instances.

    And now we added:
    • ARK Survival Evolved, 3 modded instances.

    Disk I/O was lagging behind and that caused some noticeable performance issues.

    So I purchased an Icy dock Tough Armor 4 x 2.5" mobile rack for 1 x 5.25" device bay. In this dock, I have placed three second hand 300GB 10K rpm SAS drives. These disks are meant to offset the disk I/O that's been hammering the OS drives. I also swapped out the four 80GB Intel Postville SSDs and replaced those with two new Samsung 850 Pro 256GB SSDs.

    The additional drives worked well for a few weeks. Unfortunately this morning at around 5:02AM one of those three drives failed. This is not a big deal since they were running in raid-5. However it does mean I now have to move the virtual machines that were running on those drives back to the other disks they were on before. Which means we may experience some slow downs in the time to come.

    The 10K rpm SAS drives were second hand with no warranty so the faulty drive will have to be replaced by buying another. We just had a beautiful baby girl and with all the stuff we need for that I'm not allowed (haha ) to spend more money on my hobby. So if anyone wants to help out by donating (see the front page). Those SAS drives cost about $55 each:
    HP 300GB 6G SAS 10K 2.5 inch

    At any rate the servers will continue to run. The ARK servers are running on two Samsung 850 PRO SSDs. It's just everything else that will get a performance hit now that they have to share the slower storage.

  2. #2
    Would this lower tps on the freebietfc server? It was around 18 this morning with 2 on, and around 14 now with 4 on.
    If it is I will let people know if they ask about it. And this in no way a push to get it fixed, family is first .

  3. #3
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    I'm not sure if TPS will drop when loading chunks takes a bit longer. It might.

    As for fixing the issue, that won't take much time. If/when I have the funds it's just a matter of ordering a drive. Then pull the defective drive from the server and replace it with the new one. The raid controller will then start rebuilding the array automatically. After that is done I'll move the virtual machines back to the sas drives. All in all it won't take more than half an hour of button pushing. The rebuilding and virtual machine moving will take longer but that's just a progress bar filling up

  4. #4
    Is there anything I can do to check what may be causing the tps drop?
    I used /lag and the entities where around 4000, below the 10k number I have seen on the message.
    Also at one point mem had 646 or so remaining.

    /lag
    now shows
    2065 chunks, 990 entities, 126,537 tiles, looks like it reset about 3 hours ago. and is back to 19.87 tps
    though the one on the tab screen show a different number of about 4, bukkit tps?
    Last edited by Rainnmannx; 27th March 2017 at 01:12.

  5. #5
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    /lag shows TPS from the Bukkit side of the server, same as the value shown when pressing TAB. Use /forge tps instead to get a more accurate reading on how the server is doing. Bukkit TPS tends to always be slightly lower than the Forge TPS and less than 20 (19,xx). To me it looks like one sits on top of the other since Forge and Forge mods seem to take precedence to Bukkit and Bukkit plugins.

    The TAB TPS being low was due to a glitch in BungeeCord and the plugin that takes care of that information. I've restarted BungeeCord and now the 'Bukkit TPS' in the TAB screen is similar to that when you type /lag or /tps

    Name:  2017-03-27_10.53.00.png
Views: 230
Size:  305.3 KB

    Name:  2017-03-27_10.54.13.png
Views: 228
Size:  290.8 KB

    Bottom pic shows output of:
    /lag
    /tps
    /forge tps

  6. #6
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    I've received three donations last night. Thanks guys!
    With this I'm going to purchase two new drives. Unfortunately the drives I linked above won't ship to The Netherlands. Buying them here (European Amazon) they are 89.98 euro. Which means to avoid an angry wife we could really use another donation or two

    The two drives will replace the faulty one and expand the array to give us a net storage of 900GB with more I/O. (Raid-5 capacity = n-1, meaning: 4x300 - 300) The drives should arrive in a few days.

  7. #7
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    The new hard drives were delivered today. I put them in the server and, although this is something for which the server can remain on, it went down anyway. Reason being my 2 year old who saw a shiny power button and just had to push it. I still haven't figured out how I can configure ESXi to ignore the power button. So... apologies for the unscheduled down time

    The new drives have been put in a raid-5 array and it's currently initializing. We're growing from 600 to 900GB (4x300 - 300) and when that's done I'll start moving virtual machines back to this array. After that server performance should be back to normal.

  8. #8
    Maybe you can borrow the dance dance authentication from stackoverflow

  9. #9
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    As some of you have noticed the server still feels sluggish from time to time. I think I may have found the culprit. It's Dynmap! We have Dynmap running on our TFC servers, 4 in total. Those generate a ton of updates because of TFC and it's huge amount of block updates.

    Take a look at this:
    Name:  Dynmap Disk IO.png
Views: 172
Size:  139.4 KB

    What you see here are statistics for our WD Purple hard drive which is solely used to store Dynmap tiles. It does nothing else. There's no raid, just a single drive. As you can see the system is sending up to 1600 IOPS (input/output operations per second) to that drive. As a rule of thumb a regular hard drive can only do about 100 IOPS.

    The reason this is slowing down the rest of the server is that, while it is a single drive, it is connected to the same raid controller as the rest of the hard drives. When a drive can't keep up with the requested amount of IOPS, a bunch of these get queued. The raid controller has a queue depth of 255. When that queue gets saturated, IOPS meant for other raid arrays are put back in line and need to compete on a flooded I/O path.

    What I'm going to do is move the WD Purple drive to the onboard SATA controller. Since it doesn't use raid anyway, Dynmap can saturate that SATA controller all it wants to. If I'm right about this we should see lower latency for all other raid volumes meaning everything should run smoother.

    I'm probably going to do this somewhere tonight. So if the server's down, that's why

  10. #10
    Cake! InsaneJ's Avatar
    Join Date
    Jan 2012
    Location
    Cakeville
    Posts
    4,764
    Blog Entries
    21
    As it turns out, the onboard SATA controller isn't supported by VMWare ESXi. So instead I did the next best thing, I limited the maximum amount of IOPS the virtual machine could send to the WD Purple drive. This resulted in the graphs below:
    Name:  Dynmap Disk IO after rate limiting.png
Views: 167
Size:  130.2 KB

    So it respects the hard limit of 100 IOPS. And as predicted the latency for all the other raid arrays has gone done significantly making everything feel snappy again.

    Now because the drive is being limited to 100 IOPS, which is still a crazy amount, Dynmap for the TFC servers may feel a bit sluggish. Sometimes it takes a while before it loads the tiles. Just give it a moment and it'll eventually load everything.

    Next up on our agenda is trying to figure out why exactly Dynmap is generating such a crazy amount of IOPS. It's only doing a few KB/s. So I'm not sure what's going on there. Sure TFC does a huge amount of updates, but even this is far beyond what I'd expect.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •