1. When making a thread, please tag your thread accordingly using the menu to the left of the textfield where you name your thread where applicable. Server Advertisements and Mod Releases should be contained to their respective subforums.

Server Help Developed Worlds Randomly Wiping and (Sometimes) Returning

Discussion in 'Multiplayer' started by Eberict, Feb 21, 2014.

  1. Eberict

    Eberict Void-Bound Voyager

    Today, Mug and I attempted to recreate the vanishing world problem with a private server test on Unstable. In an earlier test on v5, we were able to corrupt a world with only three players building/digging on it, so we wanted to make a log of that happening again for you. Unfortunately, we had great difficulty connecting to Mug's server (join failed errors) as Mug's ports kept randomly closing rather than opening/forwarding -- a problem that went away when we reverted to v5 and tried connecting again. I understand that this is a different problem, but it has prevented us from testing for this one. I'm hoping the database test zip supplied for the public server owners can supply good intel in our place.
     
  2. ArchGaden

    ArchGaden Big Damn Hero

    We're running a dedicated host and the crash is just starbound. My SSH and SCP connections stay up just fine and the auto-restart kicks in instantly resetting starbound. The starbound server is up and accepting connections in less than a minute after a crash. I wish I had some insight on the crashes themselves as thats another big prolific problem with the starbound server software... but its far more tolerable than the world resets. We're hitting 40-50 players on during prime hours and I understand the software really isn't ready for that kind of punishment. We've got some serious hardware behind it though! Were I in your position, I would pester the higher ups to provide resources for an official test server... something you'd have full control over and could experiment on without having to deal with us unreliable player server admins. I'm certain you'd get far more players than you need willing to connect and play around.

    I'll try to get the database test compile up and running today, but it is something I'd have to clear with the owner I expect.
     
  3. Blixt

    Blixt Void-Bound Voyager

    Would someone be willing to upload a few .fail files (preferably recent ones)? I'd like to inspect them using my Starbound file tools to see if they're all corrupted the same way (and to maybe come up with a repair process).
     
  4. ArchGaden

    ArchGaden Big Damn Hero

    I can email you some tonight if you like. Our server generates .fail files by the metric buttload. I wouldn't hold out all that much hope of repairing them. Look back through the thread to see Kyren's comments on .fail files to get some insight.
     
  5. Blixt

    Blixt Void-Bound Voyager

    Well I did almost completely recover one .fail file from someone else. The thing is that it only seems to be part of the world metadata that is corrupted, not the world data itself. And the world metadata, if it turns out to be completely lost (as opposed to just jumbled around), can the recovered from the new .world file (since presumably the worlds will have the same seed).

    I'll DM you my e-mail. :)
     
  6. ArchGaden

    ArchGaden Big Damn Hero

    I won't be able to get you the files until I get home (about 5 hours or so).

    Kyren was talking about the meta-data corruption earlier. My idea after hearing that was just to let starbound generate new meta-data. Since worlds are generated from a seed, there is seemingly no true randomness involved on the level of a particular world. The meta-data should generate exactly the same each time. It seems you had a similar thought and proved it out then. This plan would have trouble between game versions... if the planet generation changed (which has happened before and presumably will keep happening with new content). Of course, it may not really matter so much if your sky color changes. Losing your atmosphere or experiencing a planet size change would present a challenge... To solve that issue, we could pull meta-data from a previous backup of the planet, but that does require you to have a backup around. It would be nice to be able to repair a world that has undergone significant work instead of simply reloading a backup.

    If you can get .fail repair working well, I'd like to see it automated as a temporary fix to the world reset issue. One possible approach would be to maintain a backup of world meta-data for every world, adding meta-data to this backup only when a world is first generated. Before starting the server, assume all the world meta-data for recently accessed worlds is bad and load the backup meta-data.
     
  7. The Grand Mugwump

    The Grand Mugwump Subatomic Cosmonaut

    My best word for describing it is...bizarre. The server worked fine in v5. Then I opted into unstable, made sure I was at Beta v. Enraged Koala - Update 6', protocol 640, and set up the server again. But it didn't acknowledge anyone outside my local network's attempts to join. Ports were going haywire, opening and closing themselves regardless of my forwarding settings. I tried turning off firewalls, turning anti-virus off, turning on and off UPnP (in both the router and the starbound.config), verifying my IP's to make sure things were pointing the right way, resetting the router, verifying the steam cache, and whatever other credible seeming tutorials I could find on the internet for such a problem. And then when I rolled back to v5; it worked flawlessly again. I still have no clue what was going on there and no clue how to get clients outside my local network to join my unstable v6 server.

    Also as a warning to anyone else who tests this, any of your characters who log into unstable v6 will lose their ship world if you opt back to v5.
     
  8. kyren

    kyren kyren

    I have a new theory that I would like to mention.

    I have been so far (almost) completely unsuccessful in reproducing the corruption issues that you all are experiencing. At first I thought that this must be due to environment differences, or very deep rare bugs that you have just the right environment to exploit, or something specific to all the environments I test in, but it seems that you guys experience these issues with *frightening regularity*.

    Now, why did I just say *almost* unsuccessful.. Well, I have been able to easily and repeatedly get the exact same errors that are reported to me about db corruption in one specific way... by running multiple processes that access databases concurrently.

    Currently there is NO checking done to make sure that lingering processes are not accessing files when a new starbound_server process goes live. There are no lockfiles for universe or player directories, and the behavior for two starbound processes trying to read a single database file at a time is immediate corruption.

    In fact, that database_test program I posted.. I can run it and kill -9 it hundreds of times with no errors on 3 operating systems, but if I happen to run it twice *at the same time* I get this:

    http://sprunge.us/gUJM

    Which looks rather familiar! Now, I'm not saying that this is the only avenue for this or trying to minimize the problems you're seeing, but you have to understand that I know you guys see these problems all the time, but I'm having a *damn hard* time reproducing it and I'm starting to go a little loopy.

    Could you maybe add some process checks to your scripts that run the starbound server, or look for any evidence that multiple starbound_server instances are being started at the same time?
     
  9. Grephus

    Grephus Void-Bound Voyager

    Total shot in the dark here, but what if you hosted a server for a while and had a few people from the community connect and try to break it? I know that environment differences could be at play here, but if we can break it on your end somehow at least you'd have that information readily available (and maybe eliminate a few other theories in the process, narrowing our options).
     
    Last edited: Mar 1, 2014
  10. kyren

    kyren kyren

    That's not really what I'm looking for, I'm looking for some way of reproducing it that I can *repeat* so I can inspect it. I can run servers, I have run servers, and I haven't seen this specific flavor of issue. But I need to be able to make it happen watching.

    Edit: Let me phrase that better, I need to be able to make it happen repeatedly and easily so I can make changes, trace through code, run it in a debugger, etc. I mean I could run servers and have lots of people connect to it it's just the logs I've gotten so far seem to indicate that nothing really interesting is being printed it just seems to.. corrupt. I dunno your idea isn't bad it's just it would be so much easier if somebody could find a way to break that database_test file and then pass me a script that breaks it so we can go back and forth trying to figure out what's different between my environment and theirs.
     
    Last edited: Mar 1, 2014
  11. Consumer of Souls

    Consumer of Souls Big Damn Hero

    What if you created an Avian character, beamed to planet, took person2 into your party, have him ride his ship somewhere and during the jump beam to his ship? It seems that these corruptions occur with specific races only.
     
  12. Grephus

    Grephus Void-Bound Voyager

    Sorry, maybe it's running on about two hours of sleep or something that I didn't explain myself better, it's all good. Sounded better in my head. ;)
     
  13. kyren

    kyren kyren

    No it's fine it's not a bad idea at all I'm just worried that if I get it happening in a server environment when I'm not watching and I get a bunch of log files.. well I have that already and they don't make sense! I dunno if the lockfile solution doesn't make it better then I guess that's the only thing I can do is do a wider test. I'm just frustrated :/
     
  14. Blixt

    Blixt Void-Bound Voyager

    Kyren, how does this look for repro? I ran the following command first in one terminal, then again in another (without shutting down the first):

    Code:
    Starbound.app/Contents/MacOS/starbound_server -worldcoordinate alpha:82941833:23271247:-24144880:9:10
    The first terminal then spat this out and shut down (I now have a .fail file for that world):

    Code:
    Error: UniverseServer: Could not load world db for world alpha:82941833:23271247:-24144880:9:10, removing! Cause: DBException: Error, incorrect index block signature.
    0   starbound_server                    0x0000000109f7f945 _ZN4Star13StarExceptionC2EPKc + 277
    1   starbound_server                    0x0000000109f2bfbc _ZN4Star13BTreeDatabaseINS_9ByteArrayES1_E9BTreeImpl9loadIndexEj + 1260
    2   starbound_server                    0x0000000109f404a0 _ZN4Star10BTreeMixinINS_13BTreeDatabaseINS_9ByteArrayES2_E9BTreeImplEE4findERKNSt3__110shared_ptrINS_16SimpleBTreeIndexIS2_jEEEERKS2_ + 80
    3   starbound_server                    0x0000000109f402da _ZN4Star10BTreeMixinINS_13BTreeDatabaseINS_9ByteArrayES2_E9BTreeImplEE4findERKS2_ + 106
    4   starbound_server                    0x0000000109f28c11 _ZN4Star14SimpleDatabase4findERKNS_9ByteArrayE + 81
    5   starbound_server                    0x0000000109db487f _ZN4Star12WorldStorageC2ERKNSt3__110shared_ptrINS_8IODeviceEEERKNS2_INS_20WorldGeneratorFacadeEEE + 495
    6   starbound_server                    0x0000000109d7088c _ZN4Star11WorldServerC2ERKNSt3__110shared_ptrINS_8IODeviceEEENS2_INS_5ClockEEE + 1340
    7   starbound_server                    0x0000000109d18e2c _ZNSt3__110shared_ptrIN4Star11WorldServerEE11make_sharedIJNS0_INS1_4FileEEERNS0_INS1_5ClockEEEEEES3_DpOT_ + 156
    8   starbound_server                    0x0000000109d054d5 _ZN4Star14UniverseServer11createWorldERKNS_19CelestialCoordinateE + 533
    9   starbound_server                    0x0000000109d040e1 _ZN4Star14UniverseServerC2ERKNS_6StringEbRKNS_19CelestialCoordinateE + 3185
    10  starbound_server                    0x000000010989de11 main + 1345
    11  libdyld.dylib                       0x00007fff902245fd start + 1
    12  ???                                 0x0000000000000003 0x0 + 3
    
    Info: UniverseServer: Creating world alpha:82941833:23271247:-24144880:9:10
    Info: Shutting down Star::Root
    Error: Fatal Exception Caught: DBException: Error, incorrect index block signature.
    0   starbound_server                    0x0000000109f7f945 _ZN4Star13StarExceptionC2EPKc + 277
    1   starbound_server                    0x0000000109f2bfbc _ZN4Star13BTreeDatabaseINS_9ByteArrayES1_E9BTreeImpl9loadIndexEj + 1260
    2   starbound_server                    0x0000000109f404a0 _ZN4Star10BTreeMixinINS_13BTreeDatabaseINS_9ByteArrayES2_E9BTreeImplEE4findERKNSt3__110shared_ptrINS_16SimpleBTreeIndexIS2_jEEEERKS2_ + 80
    3   starbound_server                    0x0000000109f402da _ZN4Star10BTreeMixinINS_13BTreeDatabaseINS_9ByteArrayES2_E9BTreeImplEE4findERKS2_ + 106
    4   starbound_server                    0x0000000109f28c11 _ZN4Star14SimpleDatabase4findERKNS_9ByteArrayE + 81
    5   starbound_server                    0x0000000109f29ab0 _ZN4Star20SimpleSha256Database4findERKNS_9ByteArrayE + 48
    6   starbound_server                    0x00000001099255b3 _ZN4Star23CelestialMasterDatabase8getChunkERKNS_22CelestialChunkLocationE + 275
    7   starbound_server                    0x0000000109925e0d _ZN4Star23CelestialMasterDatabase15coordinateValidERKNS_19CelestialCoordinateE + 157
    8   starbound_server                    0x0000000109926b96 _ZN4Star23CelestialMasterDatabase10parametersERKNS_19CelestialCoordinateE + 70
    9   starbound_server                    0x0000000109d05673 _ZN4Star14UniverseServer11createWorldERKNS_19CelestialCoordinateE + 947
    10  starbound_server                    0x0000000109d040e1 _ZN4Star14UniverseServerC2ERKNS_6StringEbRKNS_19CelestialCoordinateE + 3185
    11  starbound_server                    0x000000010989de11 main + 1345
    12  libdyld.dylib                       0x00007fff902245fd start + 1
    13  ???                                 0x0000000000000003 0x0 + 3
    
     
  15. kyren

    kyren kyren

    Yeah, that definitely looks like the errors everyone else is seeing. I mean, if it is that, GREAT because that means that it's not an *actual btree bug* and it's just more like.. it needs protection from accidentally running multiple copies on the same directory.

    I'm going to have a LockFile implementation finished here pretty soon, trying to get the energy to actually finish it.
     
  16. ArchGaden

    ArchGaden Big Damn Hero

    Sadly, I haven't seen anything leads me to believe we'll get an easy reproduction you can do at the desk. As a software engineer myself, I know the pain of needing solid reproduction steps. This seems to be an issue that only shows up frequently on large servers. There could be other variables at play... like players running bad mods and sending up bad data to the server...the kind of stuff you only see in the wild.

    I'm willing to bet if you ran a test server opened to the public asked on the forums for players, you'd get more players joining that you know what to do with. Cap the server at 50 and enjoy being able to reproduce the bug practically on the hour. It takes us very little time to reproduce the issue during prime hours when we hit near 50 players online. I know its actually a rather large ordeal to organize than and you'd probably want to stick some less utilized staff on it or hire someone for it, but a test bed like that will probably be required to solve a lot of the server issues.

    We'd be willing to run test versions of the server as long as the players don't have to update and then get results back to you, but you're dealing with us unpaid fans on different schedules... we're willing to help, but we're unreliable!

    As for our server configuration, we're running;

    Linux version 3.2.0-57-generic (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5 ) #87-Ubuntu SMP

    I can get any other information you need.
     
  17. ArchGaden

    ArchGaden Big Damn Hero

    I'd be curious how this kind of thing can happen in the wild. We're running an upstart script, so the starbound server process should not be started until it actually stops (due to crash or whatever). There shouldn't be a way for two starbound servers to run at once. If it is some part of the starbound server lingering after a crash and then causing world corruption on the next load, then there is probably a way we can deal with that in scripts as a temporary fix.
     
  18. kyren

    kyren kyren

    Well, it should be easy because pretty soon I'll have a locking implementation in unstable and then it should be obvious when something goes wrong. In the mean time, I saw that you had your script that rotates log files when you sent me that log file collection the other day, I dunno if it was right because you know, copying files around, zips, etc, but the timestamps for I think 4 of the logfiles were from the same minute. Check your init scripts, and maybe like.. baby starbound_server a bit more and do "pgrep starbound_server" and other things to just.. make sure it's not running.
     
  19. Grephus

    Grephus Void-Bound Voyager

    Even at that hour I still didn't explain myself better, lol, sleep deprivation for the win I suppose? I didn't mean 'just run a server and let us play on it it while you do something else', you'd have to literally sit there and babysit the thing until it inevitably crashes which unfortunately can take up to two to four hours (from what I've witnessed).

    Maybe it's something to try after you've implemented lockfile?
     
  20. ArchGaden

    ArchGaden Big Damn Hero

    The timestamps for the 4 _prevx logs files were the same because its just rotating copy. prev4 overwrites prev5, prev3 overwrites prev4 and right on down to the base log being copied to _prev. I should probably add an argument to preserve timestamps, but it was quick and dirty. In addition to moving logs around, we also force restore certain worlds when the server starts to prevent the worlds from being reset, but I turned that off to collect data on the reset.
     

Share This Page