Tutorial Namegen Files: How they work

Discussion in 'Starbound Modding' started by Alien@System, Jun 19, 2018.

  1. Alien@System

    Alien@System Tentacle Wrangler

    If you're an aspiring modder hoping to make a new race, you might want to delve deeper into how Starbound generates its names when it needs to spawn an NPC, or you click on that die button next to the name field.

    Level: Intermediate. You should already know the syntax of JSON and how to make the rest of the race. I will also be using somewhat technical language.

    Modified Files
    There are up to three files that you need to modify if you want to adjust how a name is generated.

    The first is the .species file of your race in the \species folder. There, you find the attribute "nameGen", which is set to a list of two values. The first is where males of that species get their name, the second is where the females do. The thing is formatted as follows:
    Code:
    "nameGen" : [ "/species/FILE1.config:names", "/species/FILE2.config:names" ]
    This tells the interpreter where to look for the name generator, which is in the files FILE1.config and FILE2.config, also within the \species folder. In those files, it then looks for the "names" attribute. You might be able to put the entire namegen into the .species file if you wanted, but it's better to keep things modular as they are.
    Some species use the same name generator for males and females, by having the same file name given in those entries. You can't make a mod for gendered names for those races without overriding the .species file, which is obviously problematic. In Vanilla, the Avian, Fenerox, Floran, Novakid, Penguins and Shadow races use non-gendered generators. If you make your own race, you can freely choose if you want to have the names gendered or not.

    That second file is therefore the .config file your .species file points to for the name generation. That is the crux of the entire name generation process and the file we'll spend the most time with. Like your .species file, it's located in the \species folder.

    The third kind of file is only needed when you want to use Markov Chains for your name generation. Then, you need a .namesource file in the \names folder. I will get to that later.

    The Namegen file
    The namegen file is usually called RACEnamegen.config when nongendered, or RACEmalenamegen.config and RACEfemalenamegen.config when gendered, which RACE replaced by the name of the race. In either case, it is a file with JSON syntax, that has only one object, looking like this:
    Code:
    {
      "names" : [INTERESTING STUFF HERE]
    }
    The "names" attribute must be set to a list, meaning enclosed in square brackets, otherwise your game will crash when it attempts to make a name, because as discussed above, that list full of INTERESTING STUFF is where the name generation happens.

    That list is passed on to its own interpreter, which will go through the entries in a certain way to at the end give back a name. To determine which way it's supposed to go, the first entry in that list should be a JSON object looking like this:
    Code:
    { "mode" : MODE }
    where MODE is one of three possible values: "alts", "serie" and "markov". If the first entry of the list is not such an object, the interpreter will default to "alts". Now, here is what those modes mean:

    "mode" : "alts"
    In this mode, the rest of the list entries are treated as alternatives. So the interpreter will randomly pick one of those entries and evaluate that. If it's just a string, then that is our name and we're done. If it's a list, then we recursively run our interpreter over that list again, starting by checking the mode.

    This is most likely the mode you want to use when starting out, because if you just provide it a list of names, formatted as string, then the game will randomly pick off that list. This is for example used by the penguins, who have the following core in their namegen file:
    Code:
    [ "Aggy", "Biggy", "Blaggy", "Bloggy", "Boggy", "Braggy", "Cloggy", "Craggy", "Iggy", "Jaggy", "Jeggy", "Joggy", "Laggy", "Leggy", "Loggy", "Luggy", "Moggy", "Muggy", "Naggy", "Smaggy", "Snaggy", "Swaggy", "Waggy", "Wiggy", "Weggy", "Zaggy", "Ziggy" ]
    This list doesn't have a {"mode":MODE} object, so it defaults to "alts", and thus picks one of those names in the list above.
    Of the Vanilla species, the Humans, Hylotl and Penguins use just a simple namelist like that. So do the Shadow, technically, but their list is literally only one entry, "...". Every shadow person uses that name.

    "mode" : "serie"
    In this mode, the other entries are interpreted to happen serially. The interpreter starts with the first, evaluates that, then moves on the second, appends the result of that evaluation to its previous result, and so on, until it reaches the end of the list.

    As an example, here is the core of the Fenerox name generator:
    Code:
    [
    { "mode" : "serie" },
    [ "N", "D", "Ch", "B", "T" ],
    [ "ox", "ex", "ax", "ux" ]
    ]
    Because of "serie", we have to evaluate the entries one after the other. The first is a list, without a "mode" given, so it's "alts", so we pick one entry at random. For example, "Ch". That's a string, so we're done with the first entry, and move on to the second. Again, a list without "mode", so "alts", so pick one randomly, for example "ax". Again, it's a string, so we're done. We now append that to our previous result to get the final name, "Chax".
    In addition to the Fenerox, the Glitch also use the mode in this way, resulting in their characteristic two-part names.

    As an artifact from times when characters had last names, the "serie" mode has a subsetting, called by adding "titleCase": true (note the missing quotation marks. This is a boolean value, not a string) to the object. In this case, the capitalization of the words in the lists are ignored, and instead everything is lowercase except the first letter and those following a space or similar punctuation. This is mostly useful in conjunction with the "markov" mode, where a uniform capitalization isn't a given.

    "mode" : "markov"
    In that mode is encountered, there shouldn't be any more entries in the list, and if there are, they'll probably be ignored. This is the keyword for the interpreter to hand its duty over to yet another algorithm, a Markov Chain implementation.

    A Markov Chain is a way to generate new words that are stylistically similar to a list of starting words. It does that by assembling the letters in such a way that it mimics the probabilities that those letters follow each other in our library of words we gave it. As an example, here is a sample library:
    Code:
    ["Apex", "Avian" "Floran", "Glitch", "Human", "Hylotl", "Novakid", "Agaran", "Alpaca", "Ancient", "Deadbeat", "Fenerox", "Frogg", "Penguin", "Shadow"]
    First, we randomly pick a first letter from some word in our library. For example "A" from "Agaran". Now we check where that letter appears, and what follows it: "Apex", "Avian", "Floran", "Human", "Novakid", "Agaran", "Alpaca", "Ancient", "Deadbeat", "Shadow". Of those following letters, we randomly take one and append it. For example, "c", getting us "Ac". We search again, and frink that "ac" appears only once, in the word "Alpaca". Since that's not enough to know what the probabilities are like, we "forget" our oldest letter and just look for "c". We get a few matches, "Glitch", "Ancient" and "Alpaca". So we might add a "h" to get "Ach". Then maybe we add the "y" from "Hylotl", and so on, until we have enough letters or we add an End of Word letter, so to speak, because, like "an", a combination appears at the end and there is no follow-up letter.

    There are a few tricks and details about Markov Chains that depend on the implementation and aren't needed to understand them, therefore let us now look in detail at what Starbound does. The object has a few more attributes than just "mode" when used in "markov", and looks like this:
    Code:
    { "mode" : "markov", "source" : "SOURCE", "targetLength" : [3, 7] }
    "source" is our attribute that tells us where to find our Markov library of names. This file is \names\SOURCE.namesource and has this appearance:
    Code:
    {
      "name" : "SOURCE",
      "prefixSize" : 2,
      "endSize" : 2,
      "sourceNames" : [ NAMES
      ]
    }
    SOURCE should be replaced by whatever name you want to use. The vanilla sources are named after the cultures they come from, so for example aztec.namesource is used by Avian (both genders), and russianmale.namesource by Apex males.
    NAMES is a list of string entries, like the example above, and should have as many names as you can find, as the Markov Chain works better the more material it has to work with.
    The two entries "prefixSize" and "endSize" help guide the algorithm by making it ignore certain parts of words when it's not in that part of the word itself. So in out above example, the chain, after having started with "A", would ignore all those "an" found at the end of names, because it's still starting out, and doesn't want to end with just "An" and try to get away with this as valid word.
    If you don't know how many letters usually constitute to the beginning and ending of words in the language you chose, just leave it at the default setting of 2, which works out okay.

    In our namegen file, we have the last attribute, "targetLength", which is rather self-explanatory: The list it is set to, having two entries, gives us the minimum and maximum name lengths we'd like to get from our generator. It's possible that the generator enters into a bit of a loop and generates a very long name, like "Banananananananana". To prevent that, we have those options telling it to stop and get to the end of the word already. The values should be set to something sensible in relation to the names in our source file. A good rule of thumb is using the minimum and maximum lengths of names present in that list.

    The Markov Chain is used by Apex, Avian and Floran among the Vanilla races, with only the Apex having gendered versions for their generator. All of them have their generator nested within a "serie" to ensure that the capitalization is right, so it might be a good idea if you did that for your race, too:

    Code:
    {
      "names" : [
      { "mode" : "serie", "titleCase" : true },
      [ { "mode" : "markov", "source" : "SOURCE", "targetLength" : [min, max] } ]
      ]
    }
    Chaining modes
    It's important to note that the name generator works recursively, and that you can nest list within the lists to create some rather interesting possibilities, if you wanted to. Beyond the Fenerox and Glitch as shown above, that is not present in the game at the moment, although it used to be, with people having surnames generated separately from their first names.
    Nobody is stopping you from using that functionality, though. You could for example have two separate "markov" generators, chosen between with an "alt" mode, to have your race have two very dissimilar traditions for naming. Or you could for example have a single "Latin" markov source file, and use a "serie" mode to append "us" to male and "a" to female names.


    And that's basically it. If you have questions or improvement suggestions, feel free to post.
     

Share This Page