back

Self-segregating morphology and Word length

A comparison of different self-segregating morphology (SSM) methods in an IAL context

by Thomas Heller

Contents

  1. What is SSM?
  2. Why is SSM relevant?
  3. How does SSM affect word length?
  4. A note on self-termination and syntax
  5. Example phonetic inventory
  6. Methods for SSM
  7. A note on inventory size
  8. Finding the optimal solutions
  9. Comparing SSM methods and word length
  10. Formulas and parameters for each method
  11. A note on alternatives to SSM
  12. Conclusion

What is SSM?

A language has a self-segregating morphology if it is engineered in such a way that any sequence of phonemes (and, if it has phonemic orthography, also any sequence of letters) can always be broken down into individual morphemes and/or words unambiguously.

Why is SSM relevant?

SSM has a number of theoretical advantages:

Because ease of learning and clarity are usually considered desirable properties of auxlangs and loglangs respectively, SSM is usually discussed in the context of such languages, and not so much in regards to artlangs.

How does SSM affect word length?

Interestingly, another desirable property for auxlangs is a small phonetic inventory, to make them easy to pronounce for most humans on Earth. Both SSM and a small phonetic inventory reduce the number of possible syllables and syllable combinations. If the SSM method and phonetic inventory are badly chosen, the language will allow only few short words, and words will have a tendency to get quite long quickly.

Therefore, in this article, we will explore SSM methods that maximize the number of short words. To clarify: short words, in this context, are words with about 1-4 sounds that would typically be used for pronouns, conjunctions, syntactic particles etc., and very common noun/verb concepts. In contrast, medium words have about 5-6 sounds, and, for the purpose of this discussion, we consider words with 7, 8, and more sounds to be long words, used for more specific concepts, i.e. more technical terms.

A note on self-termination and syntax

As they will appear later in the discussion, I would like to mention three more concepts before I continue. I'm not aware of more common terms for these concepts in the context of constructed languages, so I'm proposing the following definitions:

Example phonetic inventory

It is generally assumed that a phonetic inventory consists of two classes of sounds, consonants (C) and vowels (V). The number of consonants is usually assumed to be greater than the number of vowels. For the sake of this discussion, we will be using the following example phonetic inventory:

Formally, this inventory can be described as C=12, V=5. Going by WALS' chapter 1 and chapter 2, the consonant inventory is small and the vowel inventory is average, respectively. I will also go into a bit more detail how inventory size affects the number of possible words after explaining the SSM methods discussed in this article.

Methods for SSM

The following methods for implementing SSM will be discussed in this article:

  1. Pure vowel boundaries (VB)
    All words begin and end with a vowel (V…V – the shortest possible pattern is VCV, but it is possible to allow single-vowel words as well). Word boundaries are detected by looking for consecutive vowels. Diphthongs are not allowed, but mid-word consonant clusters are possible. Due to its simplicity, this method is probably the most "user-friendly" for human learners.
  2. Pure consonant boundaries (CB)
    All words begin and end with a consonant (C…C – the shortest possible pattern is CVC, as single consonants are hard to pronounce for most speakers). Word boundaries are detected by looking for consonant clusters. Consonant clusters are not allowed mid-word, but mid-word diphthongs are possible. This method is similarly "user-friendly", but learners may need to put a bit of extra effort into always articulating final consonants clearly.
  3. Two sets of syllables: initial marking (IM)
    Permissible syllables (CV) are divided into two groups, a and b. A word begins with exactly one syllable from group a; then any number of syllables from group b may follow (including zero). As a special case, to make them easier to distinguish, the syllables can be divided into groups in such a way that the groups differ only in the consonant part (onset) or only in the vowel part (nucleus). (For example, all syllables containing "a", "e", or "i" occur only at the beginning of a word, and all syllables containing "o" or "u" occur only in the middle or at the end of a word.) This method works particularly well for languages that rely mainly on prefixes for inflection.
  4. Two sets of syllables: terminal marking (TM)
    Permissible syllables (CV) are divided into two groups, a and b. A word begins with zero or more syllables from group a; then exactly one syllable from group b must follow to end the word. Again, as a special case, the syllables can be divided into groups in such a way that the groups differ only in the consonant part (onset) or only in the vowel part (nucleus). (For example, all syllables containing "o" or "u" occur only at the beginning or in the middle of a word, and all syllables containing "a", "e", or "i" occur only at the end of a word.) This method works particularly well for languages that rely mainly on suffixes for inflection. The advantage of terminal marking over initial marking is that terminal marking results in a self-terminating morphology, which in turn allows for self-terminating syntax, as described above.
  5. Word-length head marking (HM)
    The initial syllable (the "head") indicates to the listener how long the current word is. After the special initial syllable, any kind of syllable can follow, but the number of syllables in the word must match the amount specified by the special head syllable. This method is probably more oriented towards computer processing, rather than actual human speakers. (The Internet protocols IPv4 and IPv6 use a similar method of indicating "payload length" in a header data field, for example.) The advantage of word-length head marking is that it is self-terminating as well. In this article, we'll be discussing three subtypes of word-length head marking which I'll explain later.
  6. Global termination (GT)
    Another relatively simple concept: Sounds are divided into two groups, a and b. Sounds of group a can occur everywhere, but words must end with a sound from group b. This method is also self-terminating, and easier to memorize than terminal marking. (It would also be possible to build a self-segregating morphology method that marks the beginning of each word with special sounds instead. This, however, would be symmetrical to Global termination and have the same statistical properties, so I'm leaving it out for brevity. It is also less interesting because it would not be self-terminating.)
  7. Glue sounds (GS)
    This is kind of like the opposite of Global termination (GT), where a word continues until it is explicitly terminated. With Glue sounds (GT), special marking is required to indicate that a word continues: From the phonetic inventory, a few sounds are selected as "glue" sounds that bind syllables of the same word together. If there is no "glue" between two syllables, they break apart into separate words. This method is not self-terminating, because we need to wait to hear if another glue sound follows or not.

The VB and CB methods always result in an odd number of sounds per word, and the IM and TM methods implemented with CV syllables result in an even number of sound per word. With the other methods, it depends on the exact implementation, as described below.

This list of seven basic self-segregating morphology methods is not exhaustive, but intended to cover the most common concepts. It is definitely possible to invent new methods for self-segregating morphology, and existing methods can also be combined. Because IALs usually demand a simple syllable structure like CV, more complex patterns are not really discussed in this article. On the other hand, some methods are simply variations of the methods described here, and have the same statistical properties, so it is safe to omit them from the discussion.

A note on inventory size

In this article we're looking at the efficiency of different SSM methods using a sample phonetic inventory defined above. You might be wondering, however, how inventory size affects the number of possible words. Here are a few basic observations:

For an IAL, I would recommend keeping the inventory small and simple, but for other types of projects, you can use this as a guideline to tweak the inventory if you intend to use a given SSM method. If you are further interested in the relationship between inventory size and number of available words, you can also go to the self-segregating morphology calculator and select the "Compare C/V ratio" option to see a visual representation.

Finding the optimal solutions

Pure vowel boundaries (VB)

"A alito umina enado isa ilamo!" (An example of what a VB language may look like.)

The efficiency of the VB method is already determined by choosing a phonetic inventory. It does not have any parameters that could be changed. In our example, we have 12 consonants and 5 vowels to work with. Just to showcase the possibility, we will allow single-vowel words, although the statistical impact is small (±5 words).

Pure consonant boundaries (CB)

"Rik delem kanat romun, sit denin gumalon…" (How a CB language may look like.)

Likewise, the efficiency of the CB method is already determined by the phonetic inventory. It does not have any parameters that could be changed. Again, in our example, we have 12 consonants and 5 vowels, and we don't allow single-consonant words for ease of pronunciation for most speakers.

Initial marking (IM)

"Mateni ramuno la dageki" (How an IM language may look like.)

The IM method, however, can be optimized along a single axis. Syllables are divided into two groups, and we can chose the number of syllables in the first group (which will determine the number of syllables in the second group).

In our example, we have 12×5 = 60 syllables following a CV pattern. Of these 60 syllables, between 1 and 59 (both inclusive) can be put in the first group, resulting in 59 to 1 (inclusive) complementary syllables in the second group.

The following table lists 9 selected examples of the 59 possible choices, highlighting the best possible values for each word length:

Set a cardinality 1 2… 15… 20… 30… 40… 45… 58 59
Set b cardinality 59 58 45 40 30 20 15 2 1
Words with length = 2 1 2 15 20 30 40 45 58 59
Words with length = 4 59 116 675 800 900 800 675 116 59
Words with length = 6 3481 6728 30375 32000 27000 16000 10125 232 59
Words with length = 8 205379 390224 1366875 1280000 810000 320000 151875 464 59
Sum of length 2 + 4 60 118 690 820 930 840 719 174 118

The general pattern is that for short words, it is better to have more words in the initial set a, and for longer words it is better to have more words in the following set b. A special case is having only one syllable in set b, which would result in a funny lexicon like for example "ma", "maba", "mababa" … "te", "teba", "tebaba" etc. where a single syllable, for example "ba", would repeat in all words longer than one syllable.

It becomes clear that over-optimizing the shortest words is not the most efficient solution overall as that makes it hard to form medium-sized words. On the other hand, there is no need to optimize for long words beyond a certain threshold. The majority of words in a language are seldom used anyway. For example, about 1000 words of the English language already make up about 75% of regularly used words (source). For practical purposes, as soon as a few thousand concepts can be expressed with reasonably short to medium sized words, a morphology is probably sufficiently optimized.

In conclusion, for our calculation we will pick an initial set a size of 30, which is the optimal solution when considering words of length 2 and 4 together, resulting in 930 possible short words. If your language requires a particularly high number of single-syllable words, it would be perfectly reasonable to choose an initial set a size of about 40 or even 45, too, which still yields 840 to 720 words of length 4, respectively.

Terminal marking (TM)

"Gomuda relika sa tena godona." (How a TM language may look like.)

The TM method is optimized exactly the same way as the IM method: Along a single axis, we can put 1 to 59 of the 60 CV syllables in the first group, and the others in the second group.

The table for terminal marking is symmetrical to the table for initial marking, so it is shown here just for completeness' sake:

Set a cardinality 1 2… 15… 20… 30… 40… 45… 58 59
Set b cardinality 59 58 45 40 30 20 15 2 1
Words with length = 2 59 58 45 40 30 20 15 2 1
Words with length = 4 59 116 675 800 900 800 675 116 59
Words with length = 6 59 232 10125 16000 27000 32000 30375 6728 3481
Words with length = 8 59 464 151875 320000 810000 1280000 1366875 390224 205379
Sum of length 2 + 4 118 174 719 840 930 820 690 118 60

Using the same optimization strategy as for initial marking above, we will get 930 possible short words for terminal marking as well.

Word-length head marking (HM)

There are several ways how word-length head marking can be implemented. One main difference is that some methods allow for fixed word lengths up to a specific maximum, while other methods allow for theoretically infinitely long words. In this article, I'm going to focus on two methods that support finite word lengths, because we're mainly interested in short words.

Pragmatic head marking (PHM)

"Filo ne fire, go fadure fanedi." (How a PHM language may look like, when including f.)

What I'm calling Pragmatic head marking here is an approach that takes advantage of the fact that really long words are usually not required anyway – if needed, more complex concepts can be expressed by compounding.

Therefore, of the 60 CV syllables available with our example inventory, we'll be using only 3 syllables to mark words of reasonable lengths, namely 4, 6, and 8 sounds. The remaining 57 syllables will all be used for short, single-syllable words. This way, we can use pragmatic head marking for languages where we would like to have as many short syllables as possible.

In theory, pragmatic head marking could be optimized for the maximum number of words with length 2 and 4, like IM / TM, by using fewer syllables for monosyllabic words and distribute the remaining syllables evenly among the reasonable word lengths longer than one syllable that we would like to mark.

However, as the following table shows, this path of optimization would yield 1143 words, which is not significantly more than the 930 words provided by an optimal IM / TM, and would lead to a quirky language with only 3 monosyllabic words. I'm not sure if there is a real use case for this, so we'll pick the interesting solution with 57 monosyllabic words for now.

Mono syllables 57… 54… 45… 40… 30… 20… 15… 6… 3
Word-length syllables 3 6 15 20 30 40 45 54 57
WL syllables per length 1 3 5 6 10 13 15 18 19
Words with length = 2 57 54 45 40 30 20 15 6 3
Words with length = 4 60 120 300 360 600 780 900 1080 1140
Words with length = 6 3600 7200 18000 21600 36000 46800 54000 64800 68400
Words with length = 8 216000 432000 1080000 1296000 2160000 2808000 3240000 3888000 4104000
Sum of length 2 + 4 117 174 345 400 630 800 915 1086 1143

Weighted head marking (WHM)

A way to make word-length head marking more efficient is by weighting the number of word-length indicating syllables by word length. In the table above, you could see in the "WL syllables per length" column that with the equal distribution of remaining syllables among word lengths 4, 6, and 8, we were always using exactly one third of the syllables for each remaining word length.

So, I'm proposing a radical approach here – to use just one syllable for lengths 6 and 8 (as with the PHM method chosen above), as this allows for generating plenty of long words already anyway, and parametrizing only the number of syllables used to mark lengths 2 and 4. (If desired, a few more syllables could be used for word length 6 instead of 2 or 4, of course.)

As the following table shows, with this method, it doesn't make sense to maximize the words of length 2 and 4 too much, because that would leave us with very few monosyllabic words. The interesting part about Weighted head marking is that if we settle for a medium value of about 30 monosyllabic words, it still gives us 1710 words for length 2 and 4, which is more than the 930 words that an optimal IM / TM yields for 30 monosyllabic words, so this is the solution we're going to chose.

Mono syllables 57… 50… 45… 40… 30… 20… 15… 10… 1
Word-length 4 syllables 1 8 13 18 28 38 43 48 57
Word-length 6 syllables 1 1 1 1 1 1 1 1 1
Word-length 8 syllables 1 1 1 1 1 1 1 1 1
Words with length = 2 57 50 45 40 30 20 15 10 1
Words with length = 4 60 480 780 1080 1680 2280 2580 2800 3420
Words with length = 6 3600 3600 3600 3600 3600 3600 3600 3600 3600
Words with length = 8 216000 216000 216000 216000 216000 216000 216000 216000 216000
Sum of length 2 + 4 117 530 825 1120 1710 2300 2595 2890 3421

Double self-segregating morphology (DSSM)

"E ana onoli oguma api o." (How a DSSM language may look like.)

Another word-length head marking approach that I'm going to discuss in this article is what I'll generalize as Double self-segregating morphology.

Considering that from a pragmatic point of view we're only really concerned with marking about 3 different word lengths, we can also use 3 of the 5 vowels for initial length marking, and continue with a CV pattern from there. The 3 vowels will be used to mark the lengths 3, 5 and 7, and the remaining 2 vowels can be used for really short, single-vowel words ("mono vowels").

The interesting aspect here is that this method actually provides redundant self-segregating morphology (hence the name DSSM): The initial vowel indicates the length of the word, plus, words begin and end with a vowel, so we'll get the benefit of the Vowel boundary (VB) method as a free bonus: If two vowels appear next to each other, we found a word boundary.

This method can be varied a little to use only one vowel for a single-vowel word, and use the freed vowel as an additional marker for word length 3, and we'll be adding both variations to the final comparison. Theoretically, as the following table shows, it is possible to use more than 2 vowels for mono vowel words, as shown in the columns for 3 and 4 mono vowels, but then (almost) no longer words would be available, so I'm not considering these options further.

Mono vowels 1 2 3 4
Word-length 3 vowels 2 1 1 1
Word-length 5 vowels 1 1 1 -
Word-length 7 vowels 1 1 - -
Words with length = 1 1 2 3 4
Words with length = 3 120 60 60 60
Words with length = 5 3600 3600 3600 -
Words with length = 7 216000 216000 - -

Overall, this particular method is not efficient compared to other methods discussed here, but I'm leaving it in so I can show an example for double self-segregating morphology.

Just to demonstrate, the efficiency of this method could be improved by increasing the number of vowels in the inventory (as with other vowel-based methods), for example C=12, V=7, as shown in the following table:

Mono vowels 1 2 3 4 5 6
Word-length 3 vowels 4 3 2 1 1 1
Word-length 5 vowels 1 1 1 1 1 -
Word-length 7 vowels 1 1 1 1 - -
Words with length = 1 1 2 3 4 5 6
Words with length = 3 336 252 168 84 84 84
Words with length = 5 7056 7056 7056 7056 7056 -
Words with length = 7 592604 592604 592604 592604 - -

Global termination (GT)

"Daken rimades pin los rumatin." (How a GT language may look like.)
"Agali adenti oruni i unedali." (How a GT language with vowels endings may look like.)

GT is simple to implement with a CV…C pattern, where all syllables are CV, but the last syllable in a word gets a final coda consonant, that cannot appear anywhere else in the word. This method can then be optimized by choosing the number of terminal coda consonants over regular initial and mid-word consonants, each forming one of two separate sets of consonants.

As the table shows, the results are straightforward: The optimal solution is 180 words of length 3. The number of words of length 3 and 5 could be optimized a bit further by settling with 160 words of length 3. As this GT implementation doesn't allow for words with length 4, I believe it is more important to optimize for words of length 3, though, and we'll choose this solution. Either way, there would be plenty of words available with length 5.

Terminating consonants 1 2 3 4 5 6 7 8 9 10 11
Regular consonants 11 10 9 8 7 6 5 4 3 2 1
Words with length = 3 55 100 135 160 175 180 175 160 135 100 55
Words with length = 5 3025 5000 6075 6400 6125 5400 4375 3200 2025 1000 275
Words with length = 7 166375 250000 273375 256000 214375 162000 109375 64000 30375 10000 1375
Words with length = 9 9150625 12500000 12301875 10240000 7503125 4860000 2734375 1280000 455625 100000 6875
Sum of length 3+5 3080 5100 6210 6560 6300 5580 4550 3360 2160 1100 330

GT can also be implemented with a V…CV pattern, where all words start with a vowel, followed by zero or more CV syllables. One set contains the vowels that are only allowed as the final vowel of a word, and vowels from the set of remaining vowels are allowed anywhere in the word. The optimization strategy is to decide how many vowels are used as terminating vowels, and how many are allowed anywhere.

Like other vowel-based methods, this method is not particularly efficient, especially compared to the simpler Pure vowel boundaries (VB) method. One advantage of Global termination (GT) with vowels is that it does allow up to four single-vowel words, which is less than Pure vowel boundaries (VB) allows, though.

When actually optimizing for many single-vowel words, the number of words of medium length decreases quickly, to only 576 words of length 5 when allowing 4 single-vowel words. Therefore, I'm going to pick 2 terminating vowels as the optimal solution, which provides most words for length 1 and 3 and 5. I'm adding this method mainly to show how vowel-based methods compare to consonant-based methods.

Terminating vowels 1 2 3 4
Regular vowels 4 3 2 1
Words with length = 1 1 2 3 4
Words with length = 3 48 72 72 48
Words with length = 5 2304 2592 1728 576
Words with length = 7 110592 93312 41472 6912
Words with length = 9 5308416 3359232 995328 82944
Sum of length 1+3+5 2353 2666 1803 628

As you may have noticed already, these implementations of Global termination would qualify as double self-segregating morphology as well. I'm reluctant to use them as an explicit example of DSSM here though, because the whole point of GT is that the syllable pattern could be anything (i.e. allowing mid-word diphthongs), and only the global terminal sounds make the difference.

Glue sounds (GS)

"Pi tensi kondunda me te lunsi lenta." (How a GS language may look like.)
"Daina de ma kaipo seiko soito duipa." (What another GS language may look like.)

To implement Glue sounds, a straightforward way is to use a CaV(Cb) syllable structure, where one or more consonants are defined as possible syllable coda, serving as the glue sounds, and the other consonants are only allowed at the beginning (onset) of a syllable. The coda consonant must be easily pronounceable regardless of which onset consonant is going to follow in the next syllable. A typical choice would be "n".

The Glue sounds method can be optimized by allowing more or less consonants to serve as glue sounds. What's special about the Glue sounds method is that it allows fairly short words of length 2 (CV), but the next available word length is only 5 (CVCCV). The method is not particularly efficient, because it always requires an extra sound. This makes it hard to choose an optimal solution.

Because Glue sounds is already quite efficient for words of length 5 with only one glue sound (providing 3025 words), I tend towards choosing one glue sound as the optimal solution, making Glue sounds one of the methods useful if your language needs a lot of really short (CV) words, similar to PHM discussed above. If that's for some reason preferable, it is also possible to use 4 glue sounds an thereby optimize for word length 5.

Number of glues Cb 1 2 3 4 5 6 7 8 9 10 11
Number of onsets Ca 11 10 9 8 7 6 5 4 3 2 1
Words with length = 2 55 50 45 40 35 30 25 20 15 10 5
Words with length = 5 3025 5000 6075 6400 6125 5400 4375 3200 2025 1000 275
Words with length = 8 166375 500000 820125 1024000 1071875 972000 765625 512000 273375 100000 15125
Sum of length 2+5 3080 5050 6120 6440 6160 5430 4400 3220 2040 1010 280

The Glue sounds method can also be done with vowels, for example with a CVa(Vb) syllable pattern, and using "i" as glue, so that there will be four diphthongs "ai", "ei", "oi", and "ui" serving as glue and the remaining four single vowels "a", "e", "o", and "u" indicate the end of a word.

As shown in the following table, it is difficult to see an actual use case for using vowels as glue sounds aside from aesthetics. This method does not really perform well, because even the most efficient choice allows fewer words than the consonant-based method. For the final comparison, I'm choosing one glue vowel, which allows for 48 CV words – making it another one of the approaches useful if a language needs many single-syllable words.

Number of glues Vb 1 2 3 4
Number of nuclei Va 4 3 2 1
Words with length = 2 48 36 24 12
Words with length = 5 2304 2592 1728 576
Words with length = 8 110592 186624 124416 27648
Sum of length 2+5 2352 2628 1752 588

Comparing SSM methods and word length

The following table shows the optimal solution for each method as discussed above, with the example phonetic inventory defined before (C=12, V=5). Methods marked with a star (*) are self-terminating. IM and TM are symmetrical; so they are shown in the same column. The best value for each word length is highlighted.

Method VB CB IM / TM* PHM* WHM* DSSM1*
DSSM2* GT(C)* GT(V)* GS(C) GS(V)
Shortest 1 (V) 3 (CVC) 2 (CVa/CVb) 2 (CVa) 2 (CVa) 1 (Va) 3 (CVaCB) 1 (Vb) 2 (CaV) 2 (CaV)
Length 1 5 - - - - 1 2 - 2 - -
Length 2 - - 30 57 30 - - - - 55 48
Length 3 300 720 - - - 120 60 180 72 - -
Length 4 - - 900 60 1680 - - - - - -
Length 5 18000 43200 - - - 3600 3600 5400 2592 3025 2304
Length 6 - - 32000 3600 3600 - - - - - -

Additionally, the following table shows the total number of words available below a given word length threshold, i.e. the sum of all words shorter than or equal to the threshold length. Again, IM and TM are displayed together, and the best values in each category are highlighted:

Method VB CB IM / TM* PHM* WHM* DSSM1*
DSSM2* GT(C)* GT(V)* GS(C) GS(V)
Length ≤ 1 5 0 0 0 0 1 2 0 2 0 0
Length ≤ 2 0 0 30 57 30 1 2 0 2 55 48
Length ≤ 3 305 720 30 57 30 121 62 180 74 55 48
Length ≤ 4 305 720 930 117 1710 121 62 180 74 55 48
Length ≤ 5 18305 43920 930 117 1710 3721 3662 5580 2666 3080 2352
Length ≤ 6 18305 43920 32930 3717 5310 3721 3662 5580 2666 3080 2352

Note that the "best" values given for each method are the optimal values according to the discussion above, not necessarily the mathematical maximum. (You can always calculate the numerical maxima using the SSM calculator.)

Overall, the most intuitive methods Pure vowel boundaries (VB) and Pure consonant boundaries (CB) are also the most efficient, if you can live with the fact that the minimum word length is usually 3 (or 1 if you allow single-vowel words for VB), and concepts of medium frequency will usually take at least 5 sounds to be expressed. Clearly, consonant boundaries are more efficient than vowel boundaries, because consonants are typically the larger group of sounds in the inventory.

The Initial marking (IM) and Terminal marking (TM) methods are more efficient in regards to shorter words (2-4 sounds), but on the other hand require at least 6 sounds for concepts of medium frequency due to their even number of sounds per words.

Pragmatic head marking (PHM) comes with a lot of length 2 words when parameterized for it, but cannot be optimized for words between 3 and 5 sounds as efficiently as IM / TM. Weighted head marking (WHM) on the other hand performs especially well for words of length 4 and shorter. Double self-segregating morphology (DSSM) needs about 1-2 more sounds than other methods to become as efficient as them, placing it somewhere in the middle between the other two redundant Global termination (GT) methods.

Global termination (GT) kind of combines the best of both worlds from the efficient "pure" methods (CB, VB) and the other self-terminating methods (TM, HM): It is self-terminating like TM and PHM/WHM, and also redundant like DSSM, and still relatively efficient at word length 3-5 compared to the "pure" methods CB and VB.

The Glue sounds (GS) methods behave a bit differently than the other methods: Unlike GT, they can already be quite efficient at word length 2, depending on the parameters, but it takes as long as until word length 8 to get a big number of long words if needed, where other methods are already more efficient at word length 7 or even 6. (In special cases with very high consonant to vowel ratio, for example C=18 and V=2, Glue sounds with consonants are actually the most efficient method for word length 8, although this is probably not relevant in practice.)

Formulas and parameters for each method

The following table lists the formulas and parameters used for the final comparison above:

Method Formula Constraint Parameters
Pure vowel boundaries (VB) V×(C×V)(L-1)/2 odd L -
Pure consonant boundaries (CB) C×(V×C)(L-1)/2 odd L ≥ 3
Initial marking (IM) CVa×CVb(L-2)/2 even L a=30, b=20
Terminal marking (TM) * CVa(L-2)/2×CVb a=20, b=30
Pragmatic head marking (PHM) * L=2: CVa
L>2: ⌊CVb/3⌋×CV(L-2)/2
L∈{2,4,6,8} a=57, b=3
Weighted head marking (WHM) * L=1: CVa
L=3: CVb×(C×V)(L-2)/2
L>3: (C×V)(L-2)/2
a=30, b=28
Double self-segregating morphology (DSSM) * L=1: Va
L=3: Vb×(C×V)(L-1)/2
L>3: (C×V)(L-1)/2
L∈{1,3,5,7} a=1, b=2
a=2, b=1
Global termination (GT) with consonants * (Ca×V)(L-1)/2×Cb odd L ≥ 3 a=6, b=6
Global termination (GT) with vowels * (Va×C)(L-1)/2×Vb odd L a=3, b=2
Glue sounds (GS) with consonants (Ca×V)(L-2)/3+1×CB(L-2)/3 L=2 or
odd L ≥ 5
a=11, b=1
Glue sounds (GS) with vowels (C×Va)(L-2)/3+1×Vb(L-2)/3 a=4, b=1

Methods marked with a star (*) are self-terminating.

You can also try these calculations yourself using the SSM calculator.

A note on alternatives to SSM

Before moving on to the final discussion, I would like to highlight that a strict SSM is not the only possible method to provide hints to the listener (or a computer) where words end:

A possible alternative to SSM is implementing vowel harmony (or consonant harmony) in a language. In short, vowel harmony involves dividing the vowel inventory into different classes, and allowing only vowels of the same class in the same word. Depending on how frequently vowels of each class occur, this provides a sort of "soft" or "fuzzy" self-segregating morphology which works for roughly 50 percent of word pairs:

If the vowel class changes between two syllables, we can be sure that they belong to different words and that there is a word boundary between them. However, if the vowels of two syllables belong to the same class, they could either actually belong to the same word, or already to the next word that coincidentally uses vowels from the same class as the previous word.

In theory, such harmony could be implemented as "syllable harmony" as well – allowing only syllables from the same class in the same word. This is already more complex than a simple vowel harmony, so for practical purposes it would probably be easier to use one of the simple "real" self-segregating morphology methods instead.

Conclusion

The most interesting part about this comparison is the fact that all methods, despite following the same goal, yield very different results (leaving aside the obvious symmetry between some methods). The good news is that basically all methods can be used to generate a useful amount of words, if they are implemented with an at least medium-sized phonetic inventory and reasonably parameterized.

As a rule of thumb:

With all the calculations provided in the context if this article, keep in mind that the mere mathematical efficiency of a given SSM method is only a subset of the features worth considering. When choosing an SSM method, look at the whole picture:

While it is possible to tweak SSM methods further, it is not needed to yield a few thousand reasonably short words. Don't let this stop you from experimenting with new methods of SSM though. Happy conlanging!


Copyright © 2021 by Thomas Heller [ˈtoːmas ˈhɛlɐ]