Self-segregating morphology and Word length
A comparison of different self-segregating morphology (SSM) methods in an IAL context
by Thomas Heller
Contents
- What is SSM?
- Why is SSM relevant?
- How does SSM affect word length?
- A note on self-termination and syntax
- Example phonetic inventory
- Methods for SSM
- A note on inventory size
- Finding the optimal solutions
- Comparing SSM methods and word length
- Formulas and parameters for each method
- A note on alternatives to SSM
- Conclusion
What is SSM?
A language has a self-segregating morphology if it is engineered in such a way that any sequence of phonemes (and, if it has phonemic orthography, also any sequence of letters) can always be broken down into individual morphemes and/or words unambiguously.
Why is SSM relevant?
SSM has a number of theoretical advantages:
- A language with SSM is easier to learn, because even if you don't understand all the individual words yet, you will still understand the structure of an utterance and get less easily confused.
- A language with SSM is easier to parse for computers, especially when word boundaries must be detected in speech before it can be converted into written text for further processing.
- A language with SSM is faster to type, because you can omit spacing, and a computer can insert it for you automatically.
Because ease of learning and clarity are usually considered desirable properties of auxlangs and loglangs respectively, SSM is usually discussed in the context of such languages, and not so much in regards to artlangs.
How does SSM affect word length?
Interestingly, another desirable property for auxlangs is a small phonetic inventory, to make them easy to pronounce for most humans on Earth. Both SSM and a small phonetic inventory reduce the number of possible syllables and syllable combinations. If the SSM method and phonetic inventory are badly chosen, the language will allow only few short words, and words will have a tendency to get quite long quickly.
Therefore, in this article, we will explore SSM methods that maximize the number of short words. To clarify: short words, in this context, are words with about 1-4 sounds that would typically be used for pronouns, conjunctions, syntactic particles etc., and very common noun/verb concepts. In contrast, medium words have about 5-6 sounds, and, for the purpose of this discussion, we consider words with 7, 8, and more sounds to be long words, used for more specific concepts, i.e. more technical terms.
A note on self-termination and syntax
As they will appear later in the discussion, I would like to mention three more concepts before I continue. I'm not aware of more common terms for these concepts in the context of constructed languages, so I'm proposing the following definitions:
- In general, a language is considered having self-segregating morphology if by looking at any words A and B of the language it can be decide unambiguously where the boundary between those two words is, without additional semantic information. In addition to that, a language can be considered having a self-terminating morphology if just by looking at the word A we know where this word ends (no more syllables will follow), without even considering word B.
- Extending the concept of SSM, a language can be considered having self-segregating syntax if by looking at two sentences S1 and S2 we can tell the boundary between both sentences unambiguously just be looking at the morphology, without extra information like pauses in speech or visual punctuation. This can, for example, be achieved by using SOV word order and requiring all sentences to have at least a verb. This way, the verb becomes the syntactic separator between sentences.
- Now, combining both of these concepts, we can have a language that can be considered having self-terminating syntax, which means that just be looking at a single sentence S1 we can tell where this sentence ends, without even looking at the next sentence S2. The advantage of self-terminating syntax for computer speech recognition is that there is no need to wait for a pause in speech to know that a request is complete. Formally, having self-terminating morphology is a necessary condition for having self-terminating syntax. So in practice, self-terminating syntax could be implemented with SOV word order plus any of the self-terminating morphology methods described below.
Example phonetic inventory
It is generally assumed that a phonetic inventory consists of two classes of sounds, consonants (C) and vowels (V). The number of consonants is usually assumed to be greater than the number of vowels. For the sake of this discussion, we will be using the following example phonetic inventory:
- Twelve consonants: p b t d k g m n s c v l
- Five vowels: a e i o u
Formally, this inventory can be described as C=12, V=5. Going by WALS' chapter 1 and chapter 2, the consonant inventory is small and the vowel inventory is average, respectively. I will also go into a bit more detail how inventory size affects the number of possible words after explaining the SSM methods discussed in this article.
Methods for SSM
The following methods for implementing SSM will be discussed in this article:
- Pure vowel boundaries (VB)
All words begin and end with a vowel (V…V – the shortest possible pattern is VCV, but it is possible to allow single-vowel words as well). Word boundaries are detected by looking for consecutive vowels. Diphthongs are not allowed, but mid-word consonant clusters are possible. Due to its simplicity, this method is probably the most "user-friendly" for human learners. - Pure consonant boundaries (CB)
All words begin and end with a consonant (C…C – the shortest possible pattern is CVC, as single consonants are hard to pronounce for most speakers). Word boundaries are detected by looking for consonant clusters. Consonant clusters are not allowed mid-word, but mid-word diphthongs are possible. This method is similarly "user-friendly", but learners may need to put a bit of extra effort into always articulating final consonants clearly. - Two sets of syllables: initial marking (IM)
Permissible syllables (CV) are divided into two groups, a and b. A word begins with exactly one syllable from group a; then any number of syllables from group b may follow (including zero). As a special case, to make them easier to distinguish, the syllables can be divided into groups in such a way that the groups differ only in the consonant part (onset) or only in the vowel part (nucleus). (For example, all syllables containing "a", "e", or "i" occur only at the beginning of a word, and all syllables containing "o" or "u" occur only in the middle or at the end of a word.) This method works particularly well for languages that rely mainly on prefixes for inflection. - Two sets of syllables: terminal marking (TM)
Permissible syllables (CV) are divided into two groups, a and b. A word begins with zero or more syllables from group a; then exactly one syllable from group b must follow to end the word. Again, as a special case, the syllables can be divided into groups in such a way that the groups differ only in the consonant part (onset) or only in the vowel part (nucleus). (For example, all syllables containing "o" or "u" occur only at the beginning or in the middle of a word, and all syllables containing "a", "e", or "i" occur only at the end of a word.) This method works particularly well for languages that rely mainly on suffixes for inflection. The advantage of terminal marking over initial marking is that terminal marking results in a self-terminating morphology, which in turn allows for self-terminating syntax, as described above. - Word-length head marking (HM)
The initial syllable (the "head") indicates to the listener how long the current word is. After the special initial syllable, any kind of syllable can follow, but the number of syllables in the word must match the amount specified by the special head syllable. This method is probably more oriented towards computer processing, rather than actual human speakers. (The Internet protocols IPv4 and IPv6 use a similar method of indicating "payload length" in a header data field, for example.) The advantage of word-length head marking is that it is self-terminating as well. In this article, we'll be discussing three subtypes of word-length head marking which I'll explain later. - Global termination (GT)
Another relatively simple concept: Sounds are divided into two groups, a and b. Sounds of group a can occur everywhere, but words must end with a sound from group b. This method is also self-terminating, and easier to memorize than terminal marking. (It would also be possible to build a self-segregating morphology method that marks the beginning of each word with special sounds instead. This, however, would be symmetrical to Global termination and have the same statistical properties, so I'm leaving it out for brevity. It is also less interesting because it would not be self-terminating.) - Glue sounds (GS)
This is kind of like the opposite of Global termination (GT), where a word continues until it is explicitly terminated. With Glue sounds (GS), special marking is required to indicate that a word continues: From the phonetic inventory, a few sounds are selected as "glue" sounds that bind syllables of the same word together. If there is no "glue" between two syllables, they break apart into separate words. This method is not self-terminating, because we need to wait to hear if another glue sound follows or not.
The VB and CB methods always result in an odd number of sounds per word, and the IM and TM methods implemented with CV syllables result in an even number of sound per word. With the other methods, it depends on the exact implementation, as described below.
This list of seven basic self-segregating morphology methods is not exhaustive, but intended to cover the most common concepts. It is definitely possible to invent new methods for self-segregating morphology, and existing methods can also be combined. Because IALs usually demand a simple syllable structure like CV, more complex patterns are not really discussed in this article. On the other hand, some methods are simply variations of the methods described here, and have the same statistical properties, so it is safe to omit them from the discussion.
A note on inventory size
In this article we're looking at the efficiency of different SSM methods using a sample phonetic inventory defined above. You might be wondering, however, how inventory size affects the number of possible words. Here are a few basic observations:
- Obviously, a bigger phonetic inventory will always result in more available words per length, and therefore in shorter words. The interesting factor besides the total size of the inventory is the ratio of consonants to vowels.
- Vowel-based methods like Pure vowel boundaries (VB), Global termination (GT) with vowels as terminal sounds, and Glue sounds (GS) with vowels as glue sounds profit from a roughly 1:2 ratio of consonants to vowels, i.e. one third consonants and two thirds vowels. Having more vowels than consonants is not naturalistic, but regardless, getting closer to this optimum (for example by having the same amount of consonants and vowels) is better than having a lot of consonants for these methods.
- Likewise, consonant-based methods like Pure consonant boundaries (CB), Global termination (GT) with consonants as terminal sounds, and Glue sounds (GS) with consonant as glue sounds profit from a roughly 2:1 ratio of consonants to vowels, i.e. two thirds consonants and one third vowels.
- Method based on CV syllables like Initial marking (IM) / Terminal marking (TM), and the head marking methods discussed below (PHM, WHM, DSSM) are most efficient with inventories where the number of consonants and vowels is equal.
- As a special case, if you are using a method that allows for words of length 1 (mono vowel words), like Pure vowel boundaries (VB), Double self-segregating morphology (DSSM) discussed below, or Global termination (GT) with vowels, and you want more of such words consisting of a single vowel, you will obviously need more vowels, regardless of the ratio.
For an IAL, I would recommend keeping the inventory small and simple, but for other types of projects, you can use this as a guideline to tweak the inventory if you intend to use a given SSM method. If you are further interested in the relationship between inventory size and number of available words, you can also go to the self-segregating morphology calculator and select the "Compare C/V ratio" option to see a visual representation.
Finding the optimal solutions
Pure vowel boundaries (VB)
"A alito umina enado isa ilamo!" (An example of what a VB language may look like.)
The efficiency of the VB method is already determined by choosing a phonetic inventory. It does not have any parameters that could be changed. In our example, we have 12 consonants and 5 vowels to work with. Just to showcase the possibility, we will allow single-vowel words, although the statistical impact is small (±5 words).
Pure consonant boundaries (CB)
"Rik delem kanat romun, sit denin gumalon…" (How a CB language may look like.)
Likewise, the efficiency of the CB method is already determined by the phonetic inventory. It does not have any parameters that could be changed. Again, in our example, we have 12 consonants and 5 vowels, and we don't allow single-consonant words for ease of pronunciation for most speakers.
Initial marking (IM)
"Mateni ramuno la dageki" (How an IM language may look like.)
The IM method, however, can be optimized along a single axis. Syllables are divided into two groups, and we can chose the number of syllables in the first group (which will determine the number of syllables in the second group).
In our example, we have 12×5 = 60 syllables following a CV pattern. Of these 60 syllables, between 1 and 59 (both inclusive) can be put in the first group, resulting in 59 to 1 (inclusive) complementary syllables in the second group.
The following table lists 9 selected examples of the 59 possible choices, highlighting the best possible values for each word length:
Set a cardinality | 1 | 2… | 15… | 20… | 30… | 40… | 45… | 58 | 59 |
---|---|---|---|---|---|---|---|---|---|
Set b cardinality | 59 | 58 | 45 | 40 | 30 | 20 | 15 | 2 | 1 |
Words with length = 2 | 1 | 2 | 15 | 20 | 30 | 40 | 45 | 58 | 59 |
Words with length = 4 | 59 | 116 | 675 | 800 | 900 | 800 | 675 | 116 | 59 |
Words with length = 6 | 3481 | 6728 | 30375 | 32000 | 27000 | 16000 | 10125 | 232 | 59 |
Words with length = 8 | 205379 | 390224 | 1366875 | 1280000 | 810000 | 320000 | 151875 | 464 | 59 |
Sum of length 2 + 4 | 60 | 118 | 690 | 820 | 930 | 840 | 719 | 174 | 118 |
The general pattern is that for short words, it is better to have more words in the initial set a, and for longer words it is better to have more words in the following set b. A special case is having only one syllable in set b, which would result in a funny lexicon like for example "ma", "maba", "mababa" … "te", "teba", "tebaba" etc. where a single syllable, for example "ba", would repeat in all words longer than one syllable.
It becomes clear that over-optimizing the shortest words is not the most efficient solution overall as that makes it hard to form medium-sized words. On the other hand, there is no need to optimize for long words beyond a certain threshold. The majority of words in a language are seldom used anyway. For example, about 1000 words of the English language already make up about 75% of regularly used words (source). For practical purposes, as soon as a few thousand concepts can be expressed with reasonably short to medium sized words, a morphology is probably sufficiently optimized.
In conclusion, for our calculation we will pick an initial set a size of 30, which is the optimal solution when considering words of length 2 and 4 together, resulting in 930 possible short words. If your language requires a particularly high number of single-syllable words, it would be perfectly reasonable to choose an initial set a size of about 40 or even 45, too, which still yields 840 to 720 words of length 4, respectively.
Terminal marking (TM)
"Gomuda relika sa tena godona." (How a TM language may look like.)
The TM method is optimized exactly the same way as the IM method: Along a single axis, we can put 1 to 59 of the 60 CV syllables in the first group, and the others in the second group.
The table for terminal marking is symmetrical to the table for initial marking, so it is shown here just for completeness' sake:
Set a cardinality | 1 | 2… | 15… | 20… | 30… | 40… | 45… | 58 | 59 |
---|---|---|---|---|---|---|---|---|---|
Set b cardinality | 59 | 58 | 45 | 40 | 30 | 20 | 15 | 2 | 1 |
Words with length = 2 | 59 | 58 | 45 | 40 | 30 | 20 | 15 | 2 | 1 |
Words with length = 4 | 59 | 116 | 675 | 800 | 900 | 800 | 675 | 116 | 59 |
Words with length = 6 | 59 | 232 | 10125 | 16000 | 27000 | 32000 | 30375 | 6728 | 3481 |
Words with length = 8 | 59 | 464 | 151875 | 320000 | 810000 | 1280000 | 1366875 | 390224 | 205379 |
Sum of length 2 + 4 | 118 | 174 | 719 | 840 | 930 | 820 | 690 | 118 | 60 |
Using the same optimization strategy as for initial marking above, we will get 930 possible short words for terminal marking as well.
Word-length head marking (HM)
There are several ways how word-length head marking can be implemented. One main difference is that some methods allow for fixed word lengths up to a specific maximum, while other methods allow for theoretically infinitely long words. In this article, I'm going to focus on two methods that support finite word lengths, because we're mainly interested in short words.
Pragmatic head marking (PHM)
"Filo ne fire, go fadure fanedi." (How a PHM language may look like, when including f.)
What I'm calling Pragmatic head marking here is an approach that takes advantage of the fact that really long words are usually not required anyway – if needed, more complex concepts can be expressed by compounding.
Therefore, of the 60 CV syllables available with our example inventory, we'll be using only 3 syllables to mark words of reasonable lengths, namely 4, 6, and 8 sounds. The remaining 57 syllables will all be used for short, single-syllable words. This way, we can use pragmatic head marking for languages where we would like to have as many short syllables as possible.
In theory, pragmatic head marking could be optimized for the maximum number of words with length 2 and 4, like IM / TM, by using fewer syllables for monosyllabic words and distribute the remaining syllables evenly among the reasonable word lengths longer than one syllable that we would like to mark.
However, as the following table shows, this path of optimization would yield 1143 words, which is not significantly more than the 930 words provided by an optimal IM / TM, and would lead to a quirky language with only 3 monosyllabic words. I'm not sure if there is a real use case for this, so we'll pick the interesting solution with 57 monosyllabic words for now.
Mono syllables | 57… | 54… | 45… | 40… | 30… | 20… | 15… | 6… | 3 |
---|---|---|---|---|---|---|---|---|---|
Word-length syllables | 3 | 6 | 15 | 20 | 30 | 40 | 45 | 54 | 57 |
WL syllables per length | 1 | 3 | 5 | 6 | 10 | 13 | 15 | 18 | 19 |
Words with length = 2 | 57 | 54 | 45 | 40 | 30 | 20 | 15 | 6 | 3 |
Words with length = 4 | 60 | 120 | 300 | 360 | 600 | 780 | 900 | 1080 | 1140 |
Words with length = 6 | 3600 | 7200 | 18000 | 21600 | 36000 | 46800 | 54000 | 64800 | 68400 |
Words with length = 8 | 216000 | 432000 | 1080000 | 1296000 | 2160000 | 2808000 | 3240000 | 3888000 | 4104000 |
Sum of length 2 + 4 | 117 | 174 | 345 | 400 | 630 | 800 | 915 | 1086 | 1143 |
Weighted head marking (WHM)
A way to make word-length head marking more efficient is by weighting the number of word-length indicating syllables by word length. In the table above, you could see in the "WL syllables per length" column that with the equal distribution of remaining syllables among word lengths 4, 6, and 8, we were always using exactly one third of the syllables for each remaining word length.
So, I'm proposing a radical approach here – to use just one syllable for lengths 6 and 8 (as with the PHM method chosen above), as this allows for generating plenty of long words already anyway, and parametrizing only the number of syllables used to mark lengths 2 and 4. (If desired, a few more syllables could be used for word length 6 instead of 2 or 4, of course.)
As the following table shows, with this method, it doesn't make sense to maximize the words of length 2 and 4 too much, because that would leave us with very few monosyllabic words. The interesting part about Weighted head marking is that if we settle for a medium value of about 30 monosyllabic words, it still gives us 1710 words for length 2 and 4, which is more than the 930 words that an optimal IM / TM yields for 30 monosyllabic words, so this is the solution we're going to chose.
Mono syllables | 57… | 50… | 45… | 40… | 30… | 20… | 15… | 10… | 1 |
---|---|---|---|---|---|---|---|---|---|
Word-length 4 syllables | 1 | 8 | 13 | 18 | 28 | 38 | 43 | 48 | 57 |
Word-length 6 syllables | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Word-length 8 syllables | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Words with length = 2 | 57 | 50 | 45 | 40 | 30 | 20 | 15 | 10 | 1 |
Words with length = 4 | 60 | 480 | 780 | 1080 | 1680 | 2280 | 2580 | 2800 | 3420 |
Words with length = 6 | 3600 | 3600 | 3600 | 3600 | 3600 | 3600 | 3600 | 3600 | 3600 |
Words with length = 8 | 216000 | 216000 | 216000 | 216000 | 216000 | 216000 | 216000 | 216000 | 216000 |
Sum of length 2 + 4 | 117 | 530 | 825 | 1120 | 1710 | 2300 | 2595 | 2890 | 3421 |
Double self-segregating morphology (DSSM)
"E ana onoli oguma api o." (How a DSSM language may look like.)
Another word-length head marking approach that I'm going to discuss in this article is what I'll generalize as Double self-segregating morphology.
Considering that from a pragmatic point of view we're only really concerned with marking about 3 different word lengths, we can also use 3 of the 5 vowels for initial length marking, and continue with a CV pattern from there. The 3 vowels will be used to mark the lengths 3, 5 and 7, and the remaining 2 vowels can be used for really short, single-vowel words ("mono vowels").
The interesting aspect here is that this method actually provides redundant self-segregating morphology (hence the name DSSM): The initial vowel indicates the length of the word, plus, words begin and end with a vowel, so we'll get the benefit of the Vowel boundary (VB) method as a free bonus: If two vowels appear next to each other, we found a word boundary.
This method can be varied a little to use only one vowel for a single-vowel word, and use the freed vowel as an additional marker for word length 3, and we'll be adding both variations to the final comparison. Theoretically, as the following table shows, it is possible to use more than 2 vowels for mono vowel words, as shown in the columns for 3 and 4 mono vowels, but then (almost) no longer words would be available, so I'm not considering these options further.
Mono vowels | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Word-length 3 vowels | 2 | 1 | 1 | 1 |
Word-length 5 vowels | 1 | 1 | 1 | - |
Word-length 7 vowels | 1 | 1 | - | - |
Words with length = 1 | 1 | 2 | 3 | 4 |
Words with length = 3 | 120 | 60 | 60 | 60 |
Words with length = 5 | 3600 | 3600 | 3600 | - |
Words with length = 7 | 216000 | 216000 | - | - |
Overall, this particular method is not efficient compared to other methods discussed here, but I'm leaving it in so I can show an example for double self-segregating morphology.
Just to demonstrate, the efficiency of this method could be improved by increasing the number of vowels in the inventory (as with other vowel-based methods), for example C=12, V=7, as shown in the following table:
Mono vowels | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Word-length 3 vowels | 4 | 3 | 2 | 1 | 1 | 1 |
Word-length 5 vowels | 1 | 1 | 1 | 1 | 1 | - |
Word-length 7 vowels | 1 | 1 | 1 | 1 | - | - |
Words with length = 1 | 1 | 2 | 3 | 4 | 5 | 6 |
Words with length = 3 | 336 | 252 | 168 | 84 | 84 | 84 |
Words with length = 5 | 7056 | 7056 | 7056 | 7056 | 7056 | - |
Words with length = 7 | 592604 | 592604 | 592604 | 592604 | - | - |
Global termination (GT)
"Daken rimades pin los rumatin." (How a GT language may look like.)
"Agali adenti oruni i unedali." (How a GT language with vowels endings may look like.)
GT is simple to implement with a CV…C pattern, where all syllables are CV, but the last syllable in a word gets a final coda consonant, that cannot appear anywhere else in the word. This method can then be optimized by choosing the number of terminal coda consonants over regular initial and mid-word consonants, each forming one of two separate sets of consonants.
As the table shows, the results are straightforward: The optimal solution is 180 words of length 3. The number of words of length 3 and 5 could be optimized a bit further by settling with 160 words of length 3. As this GT implementation doesn't allow for words with length 4, I believe it is more important to optimize for words of length 3, though, and we'll choose this solution. Either way, there would be plenty of words available with length 5.
Terminating consonants | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|
Regular consonants | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Words with length = 3 | 55 | 100 | 135 | 160 | 175 | 180 | 175 | 160 | 135 | 100 | 55 |
Words with length = 5 | 3025 | 5000 | 6075 | 6400 | 6125 | 5400 | 4375 | 3200 | 2025 | 1000 | 275 |
Words with length = 7 | 166375 | 250000 | 273375 | 256000 | 214375 | 162000 | 109375 | 64000 | 30375 | 10000 | 1375 |
Words with length = 9 | 9150625 | 12500000 | 12301875 | 10240000 | 7503125 | 4860000 | 2734375 | 1280000 | 455625 | 100000 | 6875 |
Sum of length 3+5 | 3080 | 5100 | 6210 | 6560 | 6300 | 5580 | 4550 | 3360 | 2160 | 1100 | 330 |
GT can also be implemented with a V…CV pattern, where all words start with a vowel, followed by zero or more CV syllables. One set contains the vowels that are only allowed as the final vowel of a word, and vowels from the set of remaining vowels are allowed anywhere in the word. The optimization strategy is to decide how many vowels are used as terminating vowels, and how many are allowed anywhere.
Like other vowel-based methods, this method is not particularly efficient, especially compared to the simpler Pure vowel boundaries (VB) method. One advantage of Global termination (GT) with vowels is that it does allow up to four single-vowel words, which is less than Pure vowel boundaries (VB) allows, though.
When actually optimizing for many single-vowel words, the number of words of medium length decreases quickly, to only 576 words of length 5 when allowing 4 single-vowel words. Therefore, I'm going to pick 2 terminating vowels as the optimal solution, which provides most words for length 1 and 3 and 5. I'm adding this method mainly to show how vowel-based methods compare to consonant-based methods.
Terminating vowels | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Regular vowels | 4 | 3 | 2 | 1 |
Words with length = 1 | 1 | 2 | 3 | 4 |
Words with length = 3 | 48 | 72 | 72 | 48 |
Words with length = 5 | 2304 | 2592 | 1728 | 576 |
Words with length = 7 | 110592 | 93312 | 41472 | 6912 |
Words with length = 9 | 5308416 | 3359232 | 995328 | 82944 |
Sum of length 1+3+5 | 2353 | 2666 | 1803 | 628 |
As you may have noticed already, these implementations of Global termination would qualify as double self-segregating morphology as well. I'm reluctant to use them as an explicit example of DSSM here though, because the whole point of GT is that the syllable pattern could be anything (i.e. allowing mid-word diphthongs), and only the global terminal sounds make the difference.
Glue sounds (GS)
"Pi tensi kondunda me te lunsi lenta." (How a GS language may look like.)
"Daina de ma kaipo seiko soito duipa." (What another GS language may look like.)
To implement Glue sounds, a straightforward way is to use a CaV(Cb) syllable structure, where one or more consonants are defined as possible syllable coda, serving as the glue sounds, and the other consonants are only allowed at the beginning (onset) of a syllable. The coda consonant must be easily pronounceable regardless of which onset consonant is going to follow in the next syllable. A typical choice would be "n".
The Glue sounds method can be optimized by allowing more or less consonants to serve as glue sounds. What's special about the Glue sounds method is that it allows fairly short words of length 2 (CV), but the next available word length is only 5 (CVCCV). The method is not particularly efficient, because it always requires an extra sound. This makes it hard to choose an optimal solution.
Because Glue sounds is already quite efficient for words of length 5 with only one glue sound (providing 3025 words), I tend towards choosing one glue sound as the optimal solution, making Glue sounds one of the methods useful if your language needs a lot of really short (CV) words, similar to PHM discussed above. If that's for some reason preferable, it is also possible to use 4 glue sounds an thereby optimize for word length 5.
Number of glues Cb | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|
Number of onsets Ca | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
Words with length = 2 | 55 | 50 | 45 | 40 | 35 | 30 | 25 | 20 | 15 | 10 | 5 |
Words with length = 5 | 3025 | 5000 | 6075 | 6400 | 6125 | 5400 | 4375 | 3200 | 2025 | 1000 | 275 |
Words with length = 8 | 166375 | 500000 | 820125 | 1024000 | 1071875 | 972000 | 765625 | 512000 | 273375 | 100000 | 15125 |
Sum of length 2+5 | 3080 | 5050 | 6120 | 6440 | 6160 | 5430 | 4400 | 3220 | 2040 | 1010 | 280 |
The Glue sounds method can also be done with vowels, for example with a CVa(Vb) syllable pattern, and using "i" as glue, so that there will be four diphthongs "ai", "ei", "oi", and "ui" serving as glue and the remaining four single vowels "a", "e", "o", and "u" indicate the end of a word.
As shown in the following table, it is difficult to see an actual use case for using vowels as glue sounds aside from aesthetics. This method does not really perform well, because even the most efficient choice allows fewer words than the consonant-based method. For the final comparison, I'm choosing one glue vowel, which allows for 48 CV words – making it another one of the approaches useful if a language needs many single-syllable words.
Number of glues Vb | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Number of nuclei Va | 4 | 3 | 2 | 1 |
Words with length = 2 | 48 | 36 | 24 | 12 |
Words with length = 5 | 2304 | 2592 | 1728 | 576 |
Words with length = 8 | 110592 | 186624 | 124416 | 27648 |
Sum of length 2+5 | 2352 | 2628 | 1752 | 588 |
Comparing SSM methods and word length
The following table shows the optimal solution for each method as discussed above, with the example phonetic inventory defined before (C=12, V=5). Methods marked with a star (*) are self-terminating. IM and TM are symmetrical; so they are shown in the same column. The best value for each word length is highlighted.
Method | VB | CB | IM / TM* | PHM* | WHM* | DSSM1* |
DSSM2* | GT(C)* | GT(V)* | GS(C) | GS(V) |
---|---|---|---|---|---|---|---|---|---|---|---|
Shortest | 1 (V) | 3 (CVC) | 2 (CVa/CVb) | 2 (CVa) | 2 (CVa) | 1 (Va) | 3 (CVaCB) | 1 (Vb) | 2 (CaV) | 2 (CaV) | |
Length 1 | 5 | - | - | - | - | 1 | 2 | - | 2 | - | - |
Length 2 | - | - | 30 | 57 | 30 | - | - | - | - | 55 | 48 |
Length 3 | 300 | 720 | - | - | - | 120 | 60 | 180 | 72 | - | - |
Length 4 | - | - | 900 | 60 | 1680 | - | - | - | - | - | - |
Length 5 | 18000 | 43200 | - | - | - | 3600 | 3600 | 5400 | 2592 | 3025 | 2304 |
Length 6 | - | - | 32000 | 3600 | 3600 | - | - | - | - | - | - |
Additionally, the following table shows the total number of words available below a given word length threshold, i.e. the sum of all words shorter than or equal to the threshold length. Again, IM and TM are displayed together, and the best values in each category are highlighted:
Method | VB | CB | IM / TM* | PHM* | WHM* | DSSM1* |
DSSM2* | GT(C)* | GT(V)* | GS(C) | GS(V) |
---|---|---|---|---|---|---|---|---|---|---|---|
Length ≤ 1 | 5 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 2 | 0 | 0 |
Length ≤ 2 | 0 | 0 | 30 | 57 | 30 | 1 | 2 | 0 | 2 | 55 | 48 |
Length ≤ 3 | 305 | 720 | 30 | 57 | 30 | 121 | 62 | 180 | 74 | 55 | 48 |
Length ≤ 4 | 305 | 720 | 930 | 117 | 1710 | 121 | 62 | 180 | 74 | 55 | 48 |
Length ≤ 5 | 18305 | 43920 | 930 | 117 | 1710 | 3721 | 3662 | 5580 | 2666 | 3080 | 2352 |
Length ≤ 6 | 18305 | 43920 | 32930 | 3717 | 5310 | 3721 | 3662 | 5580 | 2666 | 3080 | 2352 |
Note that the "best" values given for each method are the optimal values according to the discussion above, not necessarily the mathematical maximum. (You can always calculate the numerical maxima using the SSM calculator.)
Overall, the most intuitive methods Pure vowel boundaries (VB) and Pure consonant boundaries (CB) are also the most efficient, if you can live with the fact that the minimum word length is usually 3 (or 1 if you allow single-vowel words for VB), and concepts of medium frequency will usually take at least 5 sounds to be expressed. Clearly, consonant boundaries are more efficient than vowel boundaries, because consonants are typically the larger group of sounds in the inventory.
The Initial marking (IM) and Terminal marking (TM) methods are more efficient in regards to shorter words (2-4 sounds), but on the other hand require at least 6 sounds for concepts of medium frequency due to their even number of sounds per words.
Pragmatic head marking (PHM) comes with a lot of length 2 words when parameterized for it, but cannot be optimized for words between 3 and 5 sounds as efficiently as IM / TM. Weighted head marking (WHM) on the other hand performs especially well for words of length 4 and shorter. Double self-segregating morphology (DSSM) needs about 1-2 more sounds than other methods to become as efficient as them, placing it somewhere in the middle between the other two redundant Global termination (GT) methods.
Global termination (GT) kind of combines the best of both worlds from the efficient "pure" methods (CB, VB) and the other self-terminating methods (TM, HM): It is self-terminating like TM and PHM/WHM, and also redundant like DSSM, and still relatively efficient at word length 3-5 compared to the "pure" methods CB and VB.
The Glue sounds (GS) methods behave a bit differently than the other methods: Unlike GT, they can already be quite efficient at word length 2, depending on the parameters, but it takes as long as until word length 8 to get a big number of long words if needed, where other methods are already more efficient at word length 7 or even 6. (In special cases with very high consonant to vowel ratio, for example C=18 and V=2, Glue sounds with consonants are actually the most efficient method for word length 8, although this is probably not relevant in practice.)
Formulas and parameters for each method
The following table lists the formulas and parameters used for the final comparison above:
Method | Formula | Constraint | Parameters |
---|---|---|---|
Pure vowel boundaries (VB) | V×(C×V)(L-1)/2 | odd L | - |
Pure consonant boundaries (CB) | C×(V×C)(L-1)/2 | odd L ≥ 3 | |
Initial marking (IM) | CVa×CVb(L-2)/2 | even L | a=30, b=20 |
Terminal marking (TM) * | CVa(L-2)/2×CVb | a=20, b=30 | |
Pragmatic head marking (PHM) * | L=2: CVa L>2: ⌊CVb/3⌋×CV(L-2)/2 |
L∈{2,4,6,8} | a=57, b=3 |
Weighted head marking (WHM) * | L=1: CVa L=3: CVb×(C×V)(L-2)/2 L>3: (C×V)(L-2)/2 |
a=30, b=28 | |
Double self-segregating morphology (DSSM) * | L=1: Va L=3: Vb×(C×V)(L-1)/2 L>3: (C×V)(L-1)/2 |
L∈{1,3,5,7} | a=1, b=2 |
a=2, b=1 | |||
Global termination (GT) with consonants * | (Ca×V)(L-1)/2×Cb | odd L ≥ 3 | a=6, b=6 |
Global termination (GT) with vowels * | (Va×C)(L-1)/2×Vb | odd L | a=3, b=2 |
Glue sounds (GS) with consonants | (Ca×V)(L-2)/3+1×CB(L-2)/3 | L=2 or odd L ≥ 5 |
a=11, b=1 |
Glue sounds (GS) with vowels | (C×Va)(L-2)/3+1×Vb(L-2)/3 | a=4, b=1 |
Methods marked with a star (*) are self-terminating.
You can also try these calculations yourself using the SSM calculator.
A note on alternatives to SSM
Before moving on to the final discussion, I would like to highlight that a strict SSM is not the only possible method to provide hints to the listener (or a computer) where words end:
A possible alternative to SSM is implementing vowel harmony (or consonant harmony) in a language. In short, vowel harmony involves dividing the vowel inventory into different classes, and allowing only vowels of the same class in the same word. Depending on how frequently vowels of each class occur, this provides a sort of "soft" or "fuzzy" self-segregating morphology which works for roughly 50 percent of word pairs:
If the vowel class changes between two syllables, we can be sure that they belong to different words and that there is a word boundary between them. However, if the vowels of two syllables belong to the same class, they could either actually belong to the same word, or already to the next word that coincidentally uses vowels from the same class as the previous word.
In theory, such harmony could be implemented as "syllable harmony" as well – allowing only syllables from the same class in the same word. This is already more complex than a simple vowel harmony, so for practical purposes it would probably be easier to use one of the simple "real" self-segregating morphology methods instead.
Conclusion
The most interesting part about this comparison is the fact that all methods, despite following the same goal, yield very different results (leaving aside the obvious symmetry between some methods). The good news is that basically all methods can be used to generate a useful amount of words, if they are implemented with an at least medium-sized phonetic inventory and reasonably parameterized.
As a rule of thumb:
- Choose a simple method like Pure consonants boundaries (CB) or Pure vowel boundaries (VB). The simple methods are often as efficient as or even more efficient than "clever" methods that mark the length of a word in its head syllable. Incidentally, the simple methods are well-suited for agglutinative languages.
- If your language is intended to be processed by computers, consider using a self-terminating morphology method, as this will make statements even less ambiguous. Good choices are Terminal marking (TM) for synthetic languages that inflect the last syllable of words, or Global termination (GT) with consonants for rather analytic languages.
With all the calculations provided in the context if this article, keep in mind that the mere mathematical efficiency of a given SSM method is only a subset of the features worth considering. When choosing an SSM method, look at the whole picture:
- Do I work with a fixed phonetic inventory or can I tweak it?
- Realistically, what's the minimum number of words of a given length that I will need?
- Will the SSM method chosen result in easily pronounceable words?
- Does the SSM method match the aesthetics I want for the language?
- Will the SSM method work with the rest of the morphology (inflection, agglutination)?
- Do I need a self-terminating SSM method?
While it is possible to tweak SSM methods further, it is not needed to yield a few thousand reasonably short words. Don't let this stop you from experimenting with new methods of SSM though. Happy conlanging!
Copyright © 2021 by Thomas Heller [ˈtoːmas ˈhɛlɐ]