Tokens

A token is an occurrence of a word.

For example, the sentence "A token is an occurrence of a word" contains one token of the word 'occurrence', two tokens of the word 'a', and a total of 8 tokens.

Types

A type is a word.

For example, the sentence "A token is an occurrence of a word" contains 7 types, namely, 'a', 'token', 'is', 'an', 'occurrence', 'of', and 'word'.

Diversity

Diversity shows the type/token ratio of a text. Readings with a higher diversity are readings that contain a larger amount of types (unique words).

The maximum value of diversity is 1. In this situation, the number of types and tokens is the same which means that every single word in the text is unique, that is, there are no repeated occurrences. An example of a text with diversity 1 is "An example of a text with diversity 1".

The minimum value of diversity approximates 0. In this situation, all tokens in the text correspond to a single type. An example of a text of diversity approximating 0 is "Example example example example example example".

Dispersion

Dispersion is a measure based on the "nonbiased" or "n-1" standard deviation formula and illustrates the uniformity of distribution across the 14 BNC sublists. The dispersion value is presented on a 0 to 10 scale.

A dispersion value of 10 implies that a reading uses an equal number of families, types, or tokens from each of the 14 BNC sublists.

A dispersion value of 0 implies that a reading uses families, types, or tokens from only one of the 14 BNC sublists.

A word of caution. Standard deviation does not take into consideration the different frequency "weight" that each of the 14 BNC sublists has. In other words, two readings can have the same type dispersion value even though one of the readings uses types from only the 1st sublist while the other reading uses types from only the 14th sublist.

Stop list

The stop list contains those types in a reading that have been excluded from the vocabulary profiling. Examples of types included in the stop list are proper nouns, acronyms, foreign words, etc.

Keep in mind that a reading's token and type counts include the stop list as do average types per sentence, average tokens per sentence, and diversity. Coverage measures, on the other hand, do not include the stop list so that, for example, the combined coverage of a reading provided by the GSL and AWL may be 100% even though the reading may contain a number of proper nouns or acronyms that are not included in the GSL or AWL.

Coverage

Coverage is always expressed as the percentage of a given amount of tokens, types, or families in relation to the total tokens, types, or families found in the reading.

As mentioned in the stop list section, coverage calculations do not use the total tokens and types in a reading. In addition to subtracting tokens and types in the stop list from the respective totals, non-words are also eliminated. A non-word can be a digit (or a number expressed in digits), single letters (with the exception of 'a' and 'I'), and any string of characters that contains non-alphabetic characters of any kind. Contractions are resolved by eliminating the trailing substring following the apostrophe, with the exception of "can't" which is expanded to "cannot".

GSL

The General Service List (GSL) by West (A General Service List of English Words, 1953, London; Longman, Green & Co.) is composed of approximately 2,000 word families.

The version of the GSL used for the analyses can be found here. Another version of the GSL, perhaps more adequate for instruction, can be found here.

List Families Types
GSL 1st 1,000 998 4,119
GSL 2nd 1,000 988 3,708
------- -------
Total 1,986 7,827
AWL

The Academic Word List (AWL) by Coxhead (A New Academic Word List. TESOL Quarterly, 34(2), 2000: 213-238) contains some 570 words specific to academic texts (and not found in the GSL).

The version of the AWL used for the analyses can be found here. Another version of the AWL, perhaps more adequate for instruction, can be found here.

List Families Types
AWL 570 3,107
BNC

We use the acronym BNC to refer to 14,000 most frequent word families in the English language as compiled by Paul Nation. The BNC list is subdivided into 14 sublists each with 1,000 families.

The version of the list used for the analyses can be found here.

List Families Types
1 - 1,000 1,000 6,348
1,001 - 2,000 1,000 5,593
2,001 - 3,000 1,000 4,517
3,001 - 4,000 1,000 4,287
4,001 - 5,000 1,000 3,992
5,001 - 6,000 1,000 3,494
6,001 - 7,000 1,000 3,272
7,001 - 8,000 1,000 3,192
8,001 - 9,000 1,000 3,050
9,001 - 10,000 1,000 2,840
10,001 - 11,000 1,000 2,794
11,001 - 12,000 1,000 2,568
12,001 - 13,000 1,000 2,426
13,001 - 14,000 1,000 2,225
------- -------
Total 14,000 50,598