1. When making a thread, please tag your thread accordingly using the menu to the left of the textfield where you name your thread where applicable. Server Advertisements and Mod Releases should be contained to their respective subforums.

Server Help [CRASH BUG] Unreleased Socket Files / Too Many Open Files

Discussion in 'Multiplayer' started by Seriallos, Dec 17, 2013.

  1. Seriallos

    Seriallos Space Penguin Leader

    Looks like the Starbound linux64 server is not properly releasing socket file descriptors. Every time a client connects (or any socket connection is made), a file descriptor is opened but is never released, even on a successful/clean socket close. Eventually the server will hit the file limit and stop responding.

    On Linux, it's pretty easy to see.

    Code:
    sudo lsof | grep starbound | grep sock | wc
    Connect to the server and you can see the number of sockets increment by 1. While the actual socket is closed (as can be verified with netstat), the FD is never released.

    I'm not a C++ programmer, but the following Stack Exchange question shows the same type of behavior and it ended up being a socket leak: http://stackoverflow.com/questions/12820976/close-on-socket-not-releasing-file-descriptor
     
    class101 likes this.
  2. furrycat

    furrycat Aquatic Astronaut

    strace reveals that the server is needlessly calling socket after every connection, causing the leak exactly as you describe.

    I seem to be getting into the habit of writing very ugly LD_PRELOAD hacks...

    This one allows the first call to socket to succeed but then returns the same file descriptor for subsequent calls. That would normally be very bad indeed but since the subsequent calls aren't supposed to be made and serve no purpose, we can maybe get away with it.

    I say maybe because beyond running the server and verifying that one client, ie me, can connect, I haven't done any testing.


    As usual, create starbound.c in the linux64 directory.

    Edit: The code has been superseded by the post below.
    Code:
    #define _GNU_SOURCE
    
    
    #include <sys/socket.h>
    #include <dlfcn.h>
    #include <stdlib.h>
    
    static int reuse = -1;
    
    int socket(int domain, int type, int protocol) {
      static int (*socket_orig)(int, int, int) = NULL;
      int ret;
    
      if (! socket_orig) socket_orig = dlsym(RTLD_NEXT, "socket");
    
      if (type == SOCK_STREAM) {
        if (reuse > -1) return reuse;
      }
    
      ret = socket_orig(domain, type, protocol);
      if (type == SOCK_STREAM) reuse = ret;
    
      return ret;
    }
    Compile with
    Code:
    gcc -fPIC -shared -ldl -o starbound.so starbound.c
    Then run the server with
    Code:
    LD_PRELOAD=$PWD/starbound.so LD_LIBRARY_PATH=$PWD ./starbound_server
     
    Last edited: Dec 23, 2013
    acmintz469, class101 and Seriallos like this.
  3. Seriallos

    Seriallos Space Penguin Leader

    Wow, nice work. I may try this on my server to see if it works for multiple users until there's a real fix.

    You, sir, get an internet point.
     
  4. //EDIT: tested starbound.so working, it reuse very well the socket now instead of recreating a new one, best thread I have seen on the forum so far, fixing a problem, showing how easy is hooking in linux, amazing job guys :mwahaha:

    Noticed this crash 2 times yesterday after having set a script opening a connection on a regular basis for checking

    on CentOS the file limit is 1024, pretty easy to set a higher limit but this would not be the right workaround

    interested to give a shot to this solution

    Thanks for sharing, this should be fixed asap
     
    Last edited: Dec 18, 2013
  5. wolvern

    wolvern Orbital Explorer

    as a kinda new to linux person who is learning on the fly... i just
    nano starbound.so
    copy that in and then do the rest?

    my server is currently running and sided with a webserver on linux as we speak that i setup last night..... so i know enough linux to set that up and the remote ssh / GUI / running files as applications / adding and removing admins / and stuff...

    kinda learning everything on the fly.....
     
  6. Yes as simple, to note that this will not get ride of 100% crashes socket related, some crash will remain socket related but it will no more make your system unresponsive because of the file limit
     
  7. wolvern

    wolvern Orbital Explorer

    yeah... what confused me is it says create starbound.c but never starbound.so yet it then runs with starbound.so?

    also the file direction starbound.so > starbound.c

    yet there wasn't a .so to begin with..?
     
  8. 1) nano starbound.c and paste the content in the code

    2) gcc -fPIC -shared -ldl -o starbound.so starbound.c will create starbound.so that you place near starbound_server

    3) create second launch script like nano launch_starbound_server_patched.sh

    4) paste inside

    Code:
    #!/bin/sh
    
    cd "$(dirname "$0")"
    
    LD_PRELOAD=$PWD/starbound.so LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:./ ./starbound_server
    
     
  9. wolvern

    wolvern Orbital Explorer

    haha... you read my mind... was making that just then, just was actually thinking of how to link the directory in the scripts :rofl:

    so thanks again... just made my exploring on google cease... i'm also making a init.d command using the basic tutorial to load this via console with:

    sudo ./starbound start

    and a

    sudo ./starbound stop so it's faster... i think :)


    sooooo the sudo /etc/init.d/starbound start works with your code and my code combined... just tested it.. now to make a stop command...

    and yes it was faster than doing ~/Desktop/starbound/linux64/..... since i use init.d more so and can remember it easier :)
     
    Last edited: Dec 23, 2013
  10. Correct me if I'm wrong, the server is no more creating socket (fine), but from time to time the server could crash with 'Bad File Descriptor'

    So I decided to run a stress test of 5000 fast connexions to the server patched and 100% of the crashes are

    Code:
    Error: TcpServer will close, listener thread caught exception:  NetworkException: setSockOpt failed to set 6, 1: Bad file descriptor
    It seems the server's main thread is set to crash on a setsockopt exception, and because they are creating an infinite number of socket, they are infinitely calling this function with a risk of failure an shutdown

    I think we should hook in setsockoption and force the socket keepalive only once and return 0

    With the code below and stress tests session of 5000 connexions, 0 crash, no disconnect, only lag :)


    Code:
    #define _GNU_SOURCE
    
    #include <sys/socket.h>
    #include <dlfcn.h>
    #include <stdlib.h>
    
    static int reuse = -1;
    static int reuse2 = -1;
    
    int socket(int domain, int type, int protocol) {
        static int (*socket_orig)(int, int, int) = NULL;
        int ret;
    
        if (!socket_orig)
            socket_orig = dlsym(RTLD_NEXT, "socket");
    
        if (type == SOCK_STREAM)
            if (reuse > -1) return reuse;
    
        ret = socket_orig(domain, type, protocol);
    
        if (type == SOCK_STREAM)
            reuse = ret;
    
        return ret;
    }
    
    
    int setsockopt(int s, int level, int optname, const void *optval, socklen_t optlen) { 
        if (reuse2 == 0)
            return 0;
     
        static int (*setsockopt_orig)(int, int, int, const void *, socklen_t) = NULL;
     
        if (!setsockopt_orig)
            setsockopt_orig = dlsym(RTLD_NEXT, "setsockopt");
    
        int opt = 1;
     
        reuse2 = setsockopt_orig(s, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
     
        return 0;
    }
     
    Last edited: Dec 23, 2013
  11. furrycat

    furrycat Aquatic Astronaut

    Good find. Your fix works by ignoring the error code from the real setsockopt and always returning success. Unfortunately it overwrites the options. I propose this change:
    Code:
    #define _GNU_SOURCE
    
    #include <sys/socket.h>
    #include <dlfcn.h>
    #include <errno.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    
    static int reuse = -1;
    
    int socket(int domain, int type, int protocol) {
      static int (*socket_orig)(int, int, int) = NULL;
      int ret;
    
      if (! socket_orig) socket_orig = dlsym(RTLD_NEXT, "socket");
    
      if (type == SOCK_STREAM) {
        if (reuse > -1) return reuse;
      }
    
      ret = socket_orig(domain, type, protocol);
      if (type == SOCK_STREAM) reuse = ret;
    
      return ret;
    }
    
    int setsockopt(int sockfd, int level, int optname, const void *optval, socklen_t optlen) {
      static int (*setsockopt_orig)(int, int, int, const void *, socklen_t) = NULL;
    
      if (! setsockopt_orig) setsockopt_orig = dlsym(RTLD_NEXT, "setsockopt");
    
      if (setsockopt_orig(sockfd, level, optname, optval, optlen)) {
        printf("Warn: Ignoring error: setsockopt(%d, %d, %d, %x, %d): %s\n", sockfd, level, optname, optval, optlen, strerror(errno));
        errno = 0;
      }
    
      return 0;
    }
     
    Circuitbomb likes this.
  12. //edit: Will check too if starbound game update of the morning fixed this

    thx will try I figured out I did forgot to reuse my reuse2 so it was not really calling it once, fixed the code and will try yours too thanks for sharing
     
  13. Looks ok and cleaner than mine

    Btw Angry koala did not fix the issue, sockets still infinitely created if server is running without the patch
     
  14. furrycat

    furrycat Aquatic Astronaut

    I applied the update this morning before I looked at your code. It was still calling socket all the time and when I munged the hack to force an error from setsockopt it would still bomb out. So it looks as though the devs have not yet addressed the problem and the hack is still needed.
     
  15. furrycat

    furrycat Aquatic Astronaut

    Oh you already posted. I suck.
     
  16. Seriallos

    Seriallos Space Penguin Leader

    Also not seeing a fix for my server with Angry. Is there another place to post server crash bugs in a more official way?
     
  17. What crash do you have the log ? Me I post in Billing Help Support a thread usually with a log of the error in a code bbcode for it to be search friendly on forums, then I link the thread just made to few more post gatherings issues like

    GENERAL BUG REPORTS (400 pages)
    CRASH REPORTS: Post 'em here (200 pages)
    Starbound BETA: Bugs Database

    Not sure they are seeing everything with all these pages, they better had to set a bug tracker like planetary annihilation, should be the job of their moderators to manage this...
     
  18. wolvern

    wolvern Orbital Explorer

    i post direct images of the logs sometimes if it's small enough, less lines than the logs and more direct on the issue that caused it... but same forums... this update sure did cause a heap more bugs tho than it fixed :/ if it fixed any
     
  19. But What I'm not sure with your code is that it does not force SOL_SOCKET, SO_REUSEADDR and I think I have read somewhere else that if we reuse a socket we must set it to keep alive and as vanilla the server does not reuse the socket I doubt they are setting it to keep alive

    Will try later this alternative

    At least the major issue is (fixed) because currently 10 lines of codes suffice to crash a vanilla Starbound Linux server, in security, it is called a DoS exploit and is usually fixed faster than that
     
  20. furrycat

    furrycat Aquatic Astronaut

    setsockopt can be used to set any of a bunch of different options. You can read about them in the socket(7) manpage.

    If you modify the code to print out the parameters every time, eg
    Code:
    printf("setsockopt(%d, %d, %d, %x, %d)\n", sockfd, level, optname, optval, optlen);
    or run strace, you will see that the server calls setsockopt multiple times with different values. Forcing the arguments to SOL_SOCKET, SO_REUSEADDR would prevent that from working properly.

    When testing, I saw the following from the server:
    Code:
    setsockopt(8, 1, 2, ca2e724, 4)
    That's it setting SOL_SOCKET (1), SO_REUSEADDR (2) on the listening socket (8), which I know because I checked what reuse is but lsof would tell you that.

    Later I saw:
    Code:
    setsockopt(11, 6, 1, 3a05ccc4, 4)
    setsockopt(11, 1, 20, 3a05ccc0, 16)
    setsockopt(11, 1, 21, 3a05ccc0, 16)
    That's it calling IPPROTO_TCP (6), TCP_NODELAY (1) followed by SOL_SOCKET, SO_RCVTIMEO (20) and SOL_SOCKET, SO_SNDTIMEO (21) on the client's socket (11). In other words setting TCP_NODELAY and send/receive timeouts.

    So the server already sets SO_REUSEADDR and your original code would prevent it from setting the client options later.
     
    class101 likes this.

Share This Page