1 Overview

These plots and searchable, sortable DataTables are designed to accompany the main text. They are divided into unigram, bigram, and trigram sections. The unigrams section considers the occurrences of segments by syllabic position. The bigram and trigram sections consider (not necessarily sequential) co-occurrences of two or three segments, respectively.

For some n-grams, data are plotted as histograms or heatmaps. These are hidden by default, but can be revealed (and hidden again) by clicking the appropriate button.

In all tables, the data can be copied to the clipboard or saved as a CSV file using the buttons provided. There are also additional columns that are hidden by default; these can be revealed using the Show all button, or individuals columns can be selected using the Column visibility button. All columns can also be reordered just by clicking and dragging. The datatables can be sorted in ascending or descending order by clicking the individual column headers; the default is descending order by O/E ratio.

Most columns in the unigram, bigram, and trigram sections have the same interpretation:

the SV and NSV columns give the number of occurrences (counts) of an n-gram in a given layer.
the SV/total column shows the relative frequency (expressed as a percentage) of the n-grams that occur in the SV layer.
the O/E column gives the observed/expected ratio for this n-gram, where the expectation is based on frequencies of occurrence in the two layers:

\[\frac{\text{Count in SV list}}{\text{Count in both lists}} \times \frac{\text{Length of lexicon}}{\text{Length of SV list}}\] Since the lengths of the lexicon and the SV list are both constant, this means that O/E is linearly proportional to the percentage of occurrences in the SV layer (here, SV/total \(\times\) 0.04234).

The advantage of the O/E ratio lies in its interpretability: when O/E \(\approx\) 1, then the n-gram occurs in the SV list about as often as expected, i.e. about 24% of the time; when O/E < 1, the n-gram occurs less often than expected; and when O/E > 1, it occurs more often than expected.

Information on columns specific to a particular section are provided below.

2 Unigrams

The unigram tables contain two hidden columns %SV and %NSV, which are the list-specific positional frequencies of each segment. They are not true unigram frequencies. For example, in the Onsets table, /k/ has a %SV of 8.72%; this means that 8.72% of syllables in the SV list begin with /k/. We treat the onset as obligatory, e.g. oan is treated as having the onset /ʔ/. The %SV and %NSV columns are hidden by default; use the Show all or Column visibility buttons to reveal them.
The histograms plot each segment’s percentage of the total segments of that type in each layer, ordered by %SV. The histograms are also hidden by default.

2.1 Onsets

2.2 Medials

2.3 Nuclei

2.4 Codas

2.5 Tones

3 Bigrams

The bigram tables contain two columns PMI_SV and PMI_NSV, which give the pointwise mutual information (PMI) scores for the segment pair in the relevant list. This statistic tells us something about the status of a sequence of sounds within a given layer. PMI is a measure of how often two events co-occur, compared with how often we would expect them to occur independently:

\[\text{PMI}(x,y)=\log_2\frac{P(x,y)}{P(x)P(y)}\] These probabilities are estimated from counts, relative to a particular layer (SV or NSV). For example, if we are interested in the sequence where \(x\) is the onset /m/ and \(y\) is the nucleus /aː/, then \(P(x)\) is the number of occurrences of /m/ divided by the total number of unigrams in the list (NSV or SV), \(P(y)\) is the number of occurrences of /aː/ divided by the number of unigrams in the list, and \(P(x,y)\) is the number of occurrences of the sequence /maː/ divided by the total number of bigram sequences that begin with /m/, e.g. /mi/, /me/, /mə/, etc.

PMI is defined only for pairs of events, not for single events. Intuitively, we may think of it as a measure of surprisal: if the onset is m, how surprised are we if an a follows it? If this sequence is common, then PMI will be positive, and our surprisal will be low; when PMI is negative, then the sequence is uncommon, and our surprisal will be higher. Sequences which are common in the SV list but uncommon in the NSV list should therefore have high (positive) PMI values when the SV layer is considered, and low (negative) values when the NSV list is considered.

PMI values range from negative to positive infinity, but negative values are especially unreliable when counts are small. For this reason, in many applications, only PMI values >= 0 are considered. However, when there are sufficient observations, negative PMI can still be a useful indicator that a particular segment sequence is uncommon. For more on PMI in a linguistic context, see Goldsmith 2002; Goldsmith & Riggle 2012; Hall et al. 2017.

In the tables, PMI is colored green when it exceeds 0.25 and red when it is less than -0.25, but there is nothing inherently special about these values. These columns are hidden by default; use the Show all or Column visibility buttons to reveal them.
Again, all segments are treated as “positionally specific”. That is, final -k and onset k- are not the same k for purposes of determining frequencies (and therefore pointwise mutual information). This is partly because what we are interested in is the positional stickiness, and partially because they are arguably different (phonetic) segments.
The heatmaps indicate the number of occurrences of a bigram in a given layer. Hover over a cell in the heatmaps to see the exact count of bigrams for that cell. In the heatmaps only, bigrams with \(n=1\) are not shown.

3.1 Onset-nucleus

3.2 Onset-medial

3.3 Onset-coda

3.4 Onset-tone

3.5 Nucleus-coda

3.6 Coda-tone

4 Trigrams

4.1 Onset, medial, nucleus

4.2 Onset, medial, coda

4.3 Medial, nucleus, coda

4.4 Nucleus, coda, tone

5 Syllable structure

possible is the count of possible syllables of this shape. What counts as a “possible” syllable? Different ways to do it; here we assume:
- 24 onsets /ɓ ɗ t tʰ ʈ c k f v s z ʂ ʑ x ɣ h l r m n ɲ ŋ j ʔ/ (we distinguish orthographic d gi in addition to s x)
- 12 nuclei /aː e əː ɛ i ɨ ɔ o u iə ɨə uə/ with unrestricted distribution
- 2 nuclei /a ə/ that cannot occur in open syllables
- 1 glide /w/ which may not be preceded by /ɓ f v ʑ m n j/ or followed by /ɨ ɔ o u ɨə uə/ (ostensibly the single exception is quốc but it is typically pronounced /kwək/)
- 3 nasal codas /-m -n -ŋ/ and 3 unreleased plosive codas /-p -t -k/
- 2 semivowel codas /-w -j/ with restricted distribution: /-j/ cannot follow /i iə e ɛ/ and /-w/ cannot follow /əː ɔ o u uə/
- 1 “null” coda that can only follow the 12 nuclei with unrestricted distribution
- 6 tones that can occur with the sonorant or null codas
- 2 tones that can occur with obstruent codas
SV and NSV are the counts of syllables of these shapes in the SV and NSV lists, respectively
%SV and %NSV are the percentages of the possible number of syllables of this shape that occur in the SV or NSV lists, respectively.
%shape is the sum of %SV and %NSV in a given row.
%attested is the sum of %SV and %NSV divided by the sums of the SV and NSV columns.
%possible is the sum of %SV and %NSV divided by the sum of the possible column (17,526).

5.1 Possible and attested syllables

Takeaways:

Out of about 17,500 possible syllables, roughly half are attested, and of that half, about 25% are SV.
The distribution of attested syllables relative to possible syllables is extremely uneven. For instance, out of all possible CV sequences (including tones), nearly 80% are attested, while only about half of all possible CVN sequences are. However, syllables with a CVN shape account for 38% of all attested syllables.
Only around 10% of attested syllables contain a medial glide.

5.2 Canonical syllable shape

Trần & Vallée 2009 report that “the prevalent monosyllabic pattern in Vietnamese…was the CVC syllable type, respectively 70% and 34% of the monosyllabic words, and respectively 70% and 20% of the language syllable inventory” (2009:232). Their counts were derived from a list of words with frequency above 2% in a 5,000 word lexicon. If we collapse the above table into their three categories (CV, CVC, CCVC), we see the numbers are quite close: about 21% C(C)V, 71% CVC and 8% CCVC.

Exploring statistical regularities in the syllable canon of Sino-Vietnamese loanmorph phonology (supplementary materials)

James Kirby & Mark Alves

04 June 2021