Next-Word Prediction, Good-Turing Estimation, and Expectations of Sums

Next-Word Prediction

I open a new chat at chatgpt.com. I send the prompt “Which word is most likely to come next?” GPT promptly responds, “Sure! Please provide the sentence or sequence of words you want me to complete.” (See the full chat here.) What’s going on when I have this interaction with GPT? How does it know what to say to me?

You may have heard someone sum up how GPT works along the following lines: “It predicts the next word in a sequence, that sequence being the text input the LLM receives from the user.” As a sketch of the mechanism, this is serviceable. But it doesn’t capture how GPT differs from a simple -gram model (not that I’m presently able to articulate that difference adequately myself; I just know transformers like GPT are much more advanced, complex technologies than -gram models). The explanation is rather confusing in certain respects as well. And no, I’m not referring primarily to the use of the word “word” instead of the technically correct term “token.” Here’s what I mean.

Where does a sequence of words begin and end? You might think a sequence of words begins when a user starts typing a prompt and ends when they hit send. But obviously GPT doesn’t predict that there is no next word in the sequence just because the user has finished typing the message (and it expresses a complete thought). Or, if GPT does predict that, that predicted absence of a word differs markedly from GPT’s typical, notoriously lengthy responses.

A natural thought is, the sequence whose next word GPT predicts is the conversation the user initiates, not a single message. So the sequence is not forever limited to the prompt GPT is currently answering; it will include all the prompts and answers there ever will be in the current chat. But if the sequence is that exact conversation, something seems to be missing from our account of how GPT works. For if GPT possessed autonomy relevantly analogous to that of a normal interlocutor, GPT could make its job easy by opting out of continuing the conversation, and, if necessary, preventing the user from sending further messages in the chat. That way GPT could always predict no word will come after the prompt. We might expect GPT to sometimes predict that anyway, because, in human interactions, it’s normal, in the right context, not to respond to certain sequences of words. So why does GPT invariably predict words will come after the prompt?¹

I suspect one reason GPT does not end the conversation of its own accord is because it’s trained to say the sort of thing that tends to be said in a conversational context similar to the one it’s in, not just to say what it expects will be said (or, rather, what it expects to say) in the conversation it’s having right now. So the sequence about which it’s making predictions is a hypothetical, normal conversational sequence comparable to the current chat, not this latter chat itself. But, again, in some conversational contexts or contexts where attempts are made to initiate conversations, people tend to say nothing even if others have shown interest in dialogue with them. So there’s still a question as to why GPT never mimics that undesired silence. I’d venture to say its behavior of responding to every prompt (unless it’s been prompted not to respond, in which case it still seems reluctant, in my experience, to comply) is hard-coded, though of course the form that response takes depends on what GPT has learned from its training data and interactions with users.

Sequences of words occur in a wide variety of contexts. Conversations seem to be the most relevant type of context, since GPT plays the role of an interlocutor. But then again, it’s virtually certain that GPT was trained, in part, on non-conversational linguistic data. If it were simply predicting the next word based on the frequencies with which, in its training data, various words follow the words just input to it, sometimes it would likely do nothing but continue the user’s own train of thought, instead of appropriately responding, imparting knowledge, and offering help. So why does GPT mimic conversational, helpful behavior rather than other behaviors involving stringing words together? For now I’ll let the mystery be. But perhaps the biggest mystery of all, which I’m not content to leave unsolved, is the subject of the next section.

The Curse of Dimensionality

One of the most mathematically and computationally useful ways to understand the probability of a word ’s coming next is in terms of the actual frequency with which follows a given word sequence . That frequency is the number of times follows in GPT’s training data, divided by the total number of times occurs in those data. But for reasons I’ll explain later on in this section, as useful as this straightforward approach to calculating probabilities is, it doesn’t seem good enough for the purposes of language modeling.

Let be the training corpus of our language model (henceforth “LM”), a natural language like English (as opposed to a formal language like Java, Python, or first-order logic), the vocabulary of (i.e., the set of distinct words² in ), and the set of all sequences consisting solely of words in . To be clear, neither nor is meant to be a set like or , even though we shall use the locution “ in ” to speak of some datum in our LM’s training sample. For all and let the sequence be followed by . I shall call a subsequence of a sequence an “unbroken subsequence” if and only if any two consecutive terms of are consecutive terms of (lying between the positions in corresponding to ’s endpoints). Assume, unrealistically, that there’s a proportionality constant such that, for every word and every sequence in , how often has followed in the use of in recent years exactly equals times how often follows in . Let and be words in . These assumptions entail the following. If, in the contemporary use of , is more common than , it’s more common in as well. Relatedly, it follows that the ratio between the two words’ frequencies is the same in both and the contemporary use of . So the training data accurately represent the relative probabilities of and , conditional on ’s having come before³—that is, the ratio deducible from the data is the actual ratio, despite the limitations of the training sample. The sample is thus, in large measure, representative.

Even so, the LM may not be prepared to respond to any old prompt a user might enter, in anything like the way your average human speaker of can respond to what another speaker says. If the LM is to be ready for that, it needs to be disposed to respond appropriately to novel sentences, paragraphs, etc., i.e., sequences in which either have never occurred in the recent use of or whose instances are unrepresented in the training data . But how can the LM be so disposed? If, for all and , the LM equates with

then the LM will conclude is undefined if never shows up in . So the model won’t be able to assign a higher probability to any possible next word than to any other. In other words, the LM won’t be able to guess what the next word is, much less make a good prediction; all continuations of will effectively be equally likely from the LM’s standpoint. That’s a problem. More to the point, it’s a problem that we know the developers of LMs have already solved, considering how reliably the best LMs today give proper responses to our queries. The question is how the developers accomplished this, if LMs truly are just complicated next-word predictors.

Admittedly, a number of authors influential in the AI literature seem to have a subtly different notion than I do of the probabilistic formula -grams are based on. If you understand the probabilities like they do, you might not see the issue yet. (As I shall discuss momentarily, some of these authors have noticed how the issue I raise crops up, in a slightly different form, even in their statistical framework.) Let’s say is a sequence of words. These researchers talk as though -gram LMs equate with

Of course, if never occurs in , but some sequence of words does, then the denominator here is nonzero. But the numerator is zero; it’s also zero when occurs in but doesn’t. So we end up estimating that the probability is zero, not undefined. But that’s still a problem, one which Bengio et al. dub “the curse of dimensionality” in their 2003 article “A Neural Probabilistic Language Model.” The paper is largely above my pay grade. Nonetheless, I’ll share a quote from p. 1138, in which they touch on the matter at hand:

-gram models construct tables of conditional probabilities for the next word, for each one of a large number of contexts, i.e. combinations of the last words What happens when a new combination of words appears that was not seen in the training corpus? We do not want to assign zero probability to such cases, because such new combinations are likely to occur, and they will occur even more frequently for larger context sizes.

In “A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams,” Church and Gale describe the curse of dimensionality as “[t]he failure of the maximum likelihood estimator (MLE)” (p. 21). They expound on pp. 21–22:

Suppose that a particular pair of words occurred times in a sample of pairs of the language. The obvious estimator of the probability of the pair is The problem is that most possible pairs will not occur in a given sample Therefore, most of the possible pairs have an observed frequency However, as our example shows, we wish to take products and ratios of the estimated probabilities, and it will often happen that neither of two alternative bigrams has been seen in the training text. In this case, the MLE would not determine a likelihood ratio, as both probabilities would be estimated at zero.

Something I’d like to mention, in passing: At first blush, it strikes me as misguided to use, in a next-word prediction algorithm, the formula Bengio et al., Church, and Gale appear to say -gram models depend on. That is, even if that formula and mine are equally cursed dimensionality-wise, the former seems to track the wrong probability distribution altogether. What we should be predicting is the word that makes the most sense in the context, regardless of how rarely that context arises. But if is an unusual context, whereas -grams are ubiquitous, then dividing the number of occurrences of by the number of -gram occurrences will yield a vanishingly small probability estimate–even if follows of the times that occurs! Maybe this doesn’t skew the probability distribution, because the number of instances of every -gram is divided by the total number of -gram instances. Perhaps everything evens out in the final analysis. Just seems fishy to me. Feels like this approach would cause the LM, when faced with an atypical context, to predict the next word based on less unusual subsequences of the unusual context, or even based on the frequencies of words considered individually, apart from context. And a lot of predictively useful contextual information could be ignored in the process.

Anyway, back to the topic of this section. The fundamental difficulty is ensuring our LM can aptly answer queries that don’t match any of the word sequences it’s seen before. Human speakers of reliably comprehend and properly reply to novel sentences because they grasp exceedingly well the rules governing both the construction of sentences in and the use of individual words in But to my knowledge, all LMs have to work with are the frequencies with which words have been used in various contexts in their training data. They cannot infer from these statistical data the syntactic and semantic rules (in their full generality) that speakers of a language follow or how those rules apply in contexts unobserved in training.

You might think it’s so absurdly improbable that anyone would say something an LM like GPT has never heard before that, from a practical standpoint, there’s hardly any point in fighting this so-called “curse.” But a simple mathematical argument can be given that the sentences in vastly outnumber those in What’s more, the overwhelming majority of sentences in are not in Here’s the argument.

Let our LM be GPT and be English. A conservative estimate of the number of words in is By calculating the number of ten-word sequences in , we see that the total number of sequences in must be at least , an enormous number.⁴^,⁵ Though OpenAI has not officially revealed the size of GPT-4’s training sample, GPT-3 is estimated to have trained on GB of text from Common Crawl. Some give similar estimates of the size of GPT-4’s corpus; GB is the largest figure I’ve seen. About half of Common Crawl’s dataset is written in English (which is one reason GPT tends to communicate better in English than in other natural languages). That means English text would make up roughly of the GB of textual training data.

Five letters is the average length of an English word, but for the sake of argument we’ll make the generous simplifying assumption that every English word is a mere four letters long (just long enough that there are over distinct letter sequences/possible words of that length, given a -letter alphabet⁶). How many ten-word sequences can you fit in GB, if your -word vocabulary consists entirely of four-letter words? Well, the number of words you can fit in GB is

Now, if a single sequence of words takes up GB, then (i) those GB can simultaneously contain all the unbroken⁷ subsequences of , and (ii) the number of unbroken subsequences of is the maximum number of sequences (of the relevant, unbroken sort) that GB can contain. Consequently, GB can contain as many ten-word sequences as there are unbroken, ten-word subsequences of a sequence of words. Of course, a sequence of one or more words inevitably contains more words than unbroken, ten-word subsequences. Therefore, seeing as the number of ten-word sequences in is far larger than the estimated maximum number of words in GPT’s training data, it is likewise far larger than the number of unbroken, ten-word subsequences in those data. Indeed, is well over twice as big as either of the two numbers I just compared it to. Thus, far fewer ten-word sequences are in than are in but not in .

By the same logic, no matter what sequence length we choose, the sequences of that length in greatly outnumber those of that length in

Good-Turing Frequency Estimation

So the formulas above aren’t good enough for programming an LM to reply suitably to input that differs, however slightly, from every sample input—every bit of the training data. How can an LM handle contexts of this kind, if not via the MLE or any formula given thus far?

One idea is to attune the LM to how frequently words have various frequencies, not merely to what frequencies words have. This is the idea behind Good-Turing estimation, a technique Good originally proposed in the 1953 paper “The Population Frequencies of Species and the Estimation of Population Parameters.” Allow me to state more precisely the basic procedure, illustrating in broad brush strokes, as I go, how it might be applied in language modeling.

For any frequency with which some word follows some sequence in training corpus , we can identify , the number of words that each follow that sequence times in .⁸^,⁹ We may describe as “the frequency of the frequency .” Once we determine for each , we may proceed to line up all the frequencies of frequencies, forming a series of numbers that will be immensely useful for approximating the probability that a word will follow an unobserved sequence in the use of language (if and when some user of forms out of words in vocabulary ). The series enables us to estimate , the number of times a given word that follows a sequence times in would follow that sequence in , were a maximally representative sample of text written in . Note that is not merely a variable but a variable paired with an operator: Unlike the familiar operator, however, this one takes only one argument, namely The equations below reveal how the operation works—how it modifies its input to produce an output.

Once we have and the number of instances in of a sequence of words, we can obtain the approximate value of , the frequency with which a word follows in the recent use of (not just in , a mere sample of the contemporary use of ). We do so by estimating the expectation of

Here’s the gist of the concept of the expectation of , as I understand it. If you were to turn back time to the very beginning of the “contemporary” use of , whatever random word choices people had made since that initial time would have to be made again (though the discursive contexts in which some of those choices were made might not arise again, due to changes in prior word choices). They might choose different words that time around, so we might see a change in the ratio of the number of cases of to that of cases of Were you then to iterate this reversal of the final segment of ’s history infinitely many times, you could basically just average the values takes in the infinitely many iterations to get a decent sense of the value most likely takes in any given iteration. (Analogy: Were you to flip a coin infinitely many times, you’d expect the ratio of heads outcomes to total outcomes to approach in the long run. So if you were to guess how frequently a coin would come up heads in some sequence of coin flips, the most reasonable prediction would be half the total number of flips, if you didn’t know the number of flips in advance.) That (weighted) average is known as the expectation of

We can take this average to be the probability, given the occurrence of , that word will follow in the present and near-future use of . And fortunately, with and at our disposal, we have a way to estimate this average. For Good proved Turing’s formula to the effect that where denotes the expectation, or expected value, of a random variable such as , and where

is some hypothesis stating the frequencies with which, in the contemporary use of , all words in vocabulary follow all sequences in

Good was interested in estimating how common various species are in a population In that domain of inquiry, the number tends to be unknown, making it challenging to calculate and, by extension, . You cannot tell, just by looking at a (limited) sample of , how many species in occur zero times in that sample. To infer how many are unrepresented, you would need to be aware of how many species are in , not just in the sample. But (depending on the population) neither the sample nor common knowledge provides you with that information. Population frequencies of species differ from frequencies of words in that respect. That is, we do know roughly how many words there are in our language.¹⁰ Thus, we can find out how many words in our language

are unrepresented in a sample or

never follow sequence in the sample,

by subtracting

the number of words in the sample or

the number of words that follow in the sample

from the number of words in the language. We subtract to get the number of words that meet condition If instead we want the number that meet , we subtract Whichever of those two numbers we’re after, it makes sense to call it , depending on how we define the subscript (for which is substituted). If we define as the number of times a word occurs in a sample, is the number of words of which is true. If we define as the number of times a word follows in the sample, is the number of words of which is true. I favor the latter definition.

It must be said that the Good-Turing method is not, to my knowledge, used in the development of large language models. (Not today, at any rate). It’s one of the first approaches ever taken to predicting words in unfamiliar contexts—one that was successful and widely used in language modeling for a time, and one that has left an indelible mark on the AI literature. That is why I have taken an interest in it; at the moment I care more about how, in principle, it is feasible to predict words in unfamiliar contexts than about how this is accomplished in practice nowadays. And I believe anything we can learn about the humble origins of LLMs is bound to shed light on the way LLMs work today. We may even discover, eventually, that the rational basis for the Good-Turing method underlies the methodology of contemporary language modeling.

But what is that basis? To find out, we can read and reproduce Good’s derivation of Turing’s formula. I have no intention of laying out the full proof in this post. I only wish to establish one theorem the proof presupposes, which caught my attention.

One Theorem We’ll Need

As Good begins his proof of Turing’s formula, near the top of p. 240 of “The Population Frequencies of Species and the Estimation of Population Parameters,” Good states, “We recall the theorem that an expectation of a sum is the sum of the expectations.” No, Good, I do not recall that theorem, nor do I see why it would hold. So let’s prove it, at least for an arbitrary pair of discrete random variables (r.v.’s) and —prove that

Below I have made some minor revisions to a proof of , which I found on Mathematics Stack Exchange. Despite how small the edits are, coming up with them made a world of difference to my comprehension of the flow of reasoning. My version of the proof is broken down into steps numbered

The expectation of … What could that be? Expectations are only defined for functions, to my knowledge. But isn’t just the sum of two variables—well, r.v.’s? How is that sum a function? Technically, r.v.’s, including and , are functions. The question though is what makes the sum of two such functions a further function. We’ll answer this shortly.

The expectation of a function in two discrete variables and is defined as This licenses the inference from to , given the fact that is a function in discrete variables and

The notation may puzzle you. This is just my unconventional translation of into function notation. It may be an abuse of notation, but it makes explicit the fact that the (binary) addition operator is a function, as is any operator. It’s specifically a function that takes two inputs—no more, no fewer—hence the term “binary.” The inputs in question are the addends, typically written to the left and right of the + sign. But I’ve decided to move these inputs, and , to the positions typically occupied, in function notation, by the independent variables of a binary function: between parentheses to the right of the function symbol. Normally a function is expressed as something like , with a letter (in this case, ) denoting the function itself. is obviously not a letter. Nonetheless, you can understand as an expression of the form (with being our ) that specifically stands for the addition function.

This is no mere notational or verbal trickery. Kripkensteinian doubts notwithstanding, truly does denote a function, as it receives inputs, and every time it takes in the same set of numbers, it spits out the same number. equals and nothing else, equals and nothing else, and so on ad infinitum. Indeed, we can even write an equation which all but explicitly states that is a function with inputs and One might interpret the equation as saying the output of for the particular inputs and is the sum of and But there’s nothing stopping us from using and in this equation to represent any two values and can take, rather than a particular pair of numbers. Better yet, we can replace and with and to refer to the r.v.’s themselves rather than any specific values they may take. When we do that, we get the expression with which we began, and we see it does indeed signify a binary function If the equation is to capture the function, something on the right side has to be equivalent to on the left side, and it’s clearly not and/or alone. One possible counterpart remains: , which, like , unites and , determining how they’re combined to yield a new value.

may be a bit confusing as well. According to Wolfram MathWorld’s definition of expected value, is a probability mass function (PMF).¹¹ Presumably it outputs a probability, but the probability of what? The probability of and ? What would that even mean? Perhaps the proof on Mathematics Stack Exchange can shed some light on the subject. There, in place of we find the function notation It would appear, then, that and are interchangeable in the formula for expectation (). If that’s right, the probability that represents must be the probability that —where means takes value and means takes value But what does the comma between and represent? One can’t help but wonder as well how the two functions can be interchangeable despite the evident differences between the types of input they take: a list of comma-separated numbers in the case of and a list of comma-separated equations in the case of

There are two possibilities. One is that, despite appearances, the functions are the same. The types of input they take do not differ. The other possibility is that the functions are different, but ’s output for any given pair of inputs and equals ’s output for the input(s) The first possibility may sound absurd now. But it sounds less absurd once you realize that the input is a set, that each input to can be understood as an ordered pair, and that ordered pairs are definable as sets.

Before I explain how denotes a set in , I should say a few words about what the simpler input expression represents. It’s shorthand for , which is in turn shorthand for What we have here is a set—a funny-looking set, but a set nonetheless. This is what’s known as “set-builder notation.” We read as “the set of all in such that of equals “ More generally, signifies the set of all things in the set indicated to the left of the vertical bar that satisfy the condition to the right of the bar. In the case at hand, our set is the sample space —the set of possible outcomes of experiments involving random variable Further, is our condition So is the set containing every possible outcome whose being input to yields the output ; in measure theory this set is called the event that takes value ¹² Nota bene: Depending on the context, denotes either a general variable or a function with domain and range General variable takes (i.e., equals) value if and only if function outputs

can be read as “the probability that ” And that comma is to be understood the same way as the comma in the sentence “Let ”—as a symbol for conjunction, equivalent in meaning to the word “and.” So, the input expression is short for , which is short for , where means “and.”

So we’ve established that takes a set as input. It remains to be seen whether the domains of and include all and only the same sets, and thus are one and the same domain (as they must be if and are to be the same function). denotes a function in two variables, specifically a joint probability mass function. One may think of any function in two variables and as having either

a single domain whose elements are ordered pairs where and are the ranges of values which and , respectively, can take, or
two domains and (defined as above).

For the time being we may set aside case ii, as that directly contradicts the thesis we’re entertaining: that has the same domain has (and no others), of which there is only one.

If we take to have a domain of type i, we should interpret as shorthand for Moreover, we in that case take the domain of to be some subset, proper or improper, of Cartesian product , where and are the ranges of our discrete r.v.’s and The domain could for instance be

the set of all and only ordered pairs in such that the probability is nonzero that But let’s make things easier for ourselves by treating the entire Cartesian product as our domain. Let’s also define as our sample space, the set of possible outcomes of experiments with variables and . If is to be a probability measure, its domain must be a -algebra on , which is a special kind of set of subsets of . Seeing as the subsets of are events, the domain of must consequently be a set of events, or event space, So as long as our event space can be , the functions and can have the same domain.

Recall that includes all and only ordered pairs s.t. and It was also mentioned that ordered pairs are definable as sets. Let’s take a look at one set-theoretic definition of an ordered pair.

By this definition, if is to be an event in , then and must be possible outcomes of experiments involving variables and I see no reason that such experiments couldn’t have those outcomes. So it may be that everything in is in Is everything in in ? Well, must have certain elements to qualify as a -algebra on , and is likely to lack some of those. For example, itself must be in But assuming

contains at least three values, and
every singleton set containing an element of is an outcome in ,

is a set of three or more elements. But no set-theoretic representation of an ordered pair (conforming to the definition above) has more than two elements. So, given suppositions 1 and 2, does not include like does. Therefore, very probably, and are not the same function.

The same conclusion follows should we choose (as we certainly may) to view as having two domains of numbers rather than one of ordered pairs, assuming is a probability measure. A probability measure has only one domain: a set of subsets of a sample space. So in that case we must distinguish between the functions and

But regardless of what domain(s) has, it’s true that We still have that is a function wherein That’s how the joint probability mass function is related to the probability measure For more on this relationship, see the appendix.

With that out of the way, let’s get back to our derivation of the theorem that I consider the next four steps in the proof pretty self-explanatory.

I’ll explain briefly. We can move from to thanks to the fact that is shorthand for is allowed because multiplication is distributive. In we break a single sum of sums (henceforth “nested sum”) of + over ranges and into two such sums, one for each of the addends and This is just an application of the rule that , for arbitrary expressions and equal to real numbers. The rule holds whether or not expression or contains or

Just to be sure this rule’s intuitive appeal does not mislead, let’s work through a few examples of the rule in action. We’ll stipulate beforehand that, in each of these cases, the ranges of r.v.’s and are and , respectively.

Example 1: Let Then we have:

Example 2: Let That gives us:

Example 3: Let From this we obtain:

For the time being I’ve satisfied myself that our rule has no counterexamples. Let’s move on.

Getting from to is a bit trickier, in my opinion. Recall that reads

Now, here’s and, while we’re at it,

Once we’ve derived , will follow as a matter of course. But where does come from, exactly? Recall that reads:

So to get from to , all we need is the principle that for an arbitrary function (two such functions being and ). But to me this is even less intuitive than the rule we’ve already proposed and tested against a few examples (not that the two rules are in competition with each other). So this one stands in greater need of justification.

Let’s try a reductio ad absurdum, in hopes of achieving full generality. That is, let’s assume , the opposite (negation) of what we want to prove. We’ll proceed to show that this assumption entails a contradiction. Then we’ll have demonstrated in a roundabout way that, contrary to our supposition, Otherwise, a contradiction would be true, which is absurd/impossible!

Consequently,

does not equal

At this juncture we need to think carefully about these two, supposedly unequal nested sums, particularly about each sequence of addends/terms enclosed in a pair of square brackets. Let denote the first nested sum and the second. To be clear, we’re defining and such that

and

We know that, if we take the first term of each sequence of terms enclosed in square brackets in , we can use those terms to form the sequence

Whaddayaknow?!?—that’s the first sequence of terms in ! That is, it’s what you get when you remove the brackets from

substitute commas for the signs, and make explicit a few more of the terms.

So at least and have those terms in common. Can we find in the second sequence from ?

As a matter of fact, we can! The sequence in question is

We can form this same sequence by taking the second term of each sequence in First we took the first term of each, and now we’re taking the second. If we iterate this process more times, we’ll reproduce every sequence of terms in , exclusively using terms of We’ll then have established (in fact we already have, using our imagination) that every term of is a term of But what about vice versa? Is every term of a term of ? If so, and are sums of exactly the same terms. That would mean —contrary to the conclusion we derived from our assumption for reductio.

It turns out that there is a parallel procedure we can follow to construct each sequence of terms in , exclusively using terms of The first sequence in is just the sequence of the first terms of all the sequences in The second sequence in is the sequence of the second terms of all the sequences in And so on, all the way up to the and final sequence in

In conclusion, and have exactly the same terms; neither has a term the other lacks. So But we had already concluded that Of course, that directly contradicts that So it must be that in actuality , contrary to our initial assumption for the sake of argument. Quod erat demonstrandum! This concludes our subproof.

We can now be certain of the truth of :

And clearly , which reads

follows from by the definition of and the distributivity of multiplication. The remainder of the proof is written below.

Equation is our theorem, so we could stop here. But I feel I should elaborate on how we got from to , in case it’s unclear. Consider the first term on the right side of the equation in :

Note that we use conditional probabilities to calculate the joint probabilities of pairs of values that and might take, rather than simply multiplying the unconditional probabilities of the possible values. This way we leave room for the possibility that and are dependent.

By the same reasoning we can show that

You might wonder why and each equal Answer: For the same reason that and each equal The numbers represent all the values can take, and the numbers represent all the values can take. In other words, has to equal one of , and has to equal one of It stands to reason then that with probability , be it conditional or unconditional, either of the two r.v.’s takes some value or other in the corresponding range. But that can only be the case if the probabilities of the individual values in the range add up to , which is of course equal to

So we have obtained

as well as

Synthesizing these two results, we get

Appendix

Let be the joint cumulative distribution function (CDF) of discrete random variables and Just as the probability density function (PDF) corresponding to a continuous CDF is the derivative of that CDF, the PMF corresponding to discrete CDF is the jump of at . This jump is denoted by . A function must be continuous to be differentiable—hence the need to identify probability masses with something other than derivatives in the discrete case.

What’s the jump of a discrete joint CDF at a point? It helps to know what a joint CDF is first. I won’t give a definition (i.e., a set of individually necessary and jointly sufficient conditions), but I will explain what you need to know about the concept, which luckily enough I know. A joint CDF is a function in multiple variables such that, if each is a value a distinct random variable can take, the function outputs the probability of the conjunctive event that takes a value , a value , , and a value That is, So the joint CDF is not the joint PMF , but the two are intimately related. Intuitively, the probability that an r.v. takes a value no greater than is no less than the probability that it takes value The same goes for the probability that each of multiple r.v.’s takes a value no greater than ; that probability is no less than the probability that each takes value But the former probability could easily be greater than the latter. To be exact, is equal to In English: we can obtain the probability that, collectively, the variables each take value by subtracting the probability that some or other takes a value less than from the probability that each takes a value less than or equal to This is because, given that no exceeds , there are two mutually exhaustive and exclusive possibilities:

Each is equal to
Some or other is less than

So the probabilities of these possibilities add up to the probability that no exceeds . Moreover, the likelihood of either possibility is the likelihood of the other’s negation—again, assuming that no exceeds . But if we dispense with that assumption, then we must additionally consider how likely it is, in the first place, that no exceeds .

To reiterate: we calculate by subtracting from , with this latter expression being equivalent to We’re now well on our way to seeing what the jump of a discrete joint CDF at a point is and why it matters. Observe that equals Thus, the difference between and is

This is where the jump of at makes its entrance. We may define this jump as the difference which, believe it or not, is the same difference (pun intended) presented immediately prior. This makes a bit of sense, as we already know that , by definition. It remains to be elucidated how , how , and how

First, some definitions:

Now then, I’ll try to give you a rough idea of how these sundry puzzle pieces interlock. It’s easier to understand how the jump of a discrete CDF in one variable works than in two variables. We’ll start there. Take the CDF of a discrete random variable , where the range of is By the definition of “discrete random variable,” has countably many elements. So for each value aside from the maximum in , there is a “next” value: the least value in greater than If the set of values can take is , then relative to , the next value is ; relative to , it’s ¹³

Recall that is the probability that As you shift your attention from one value in to the next, the probability that is less than or equal to the value under consideration either increases or remains the same. This is a straightforward consequence of the fact that more values in don’t exceed the current one than didn’t exceed the last. So there’s now a greater number of ways can be less than or equal to the current value. The probabilities of those various ways add up, and each probability is nonnegative. So the overall probability never decreases. On the contrary, it tends to increase.¹⁴ And increases along with it. That increase in accompanying the transition from ’s taking one value in to its taking the next—that is the jump of at , which is also the probability mass at ¹⁵ If you want to get technical about it, this jump is , i.e., So we have the formula where

By more or less the same reasoning, the jump of discrete joint CDF at is

where

I leave it to you to make explicit how this follows, should you find the motivation to do so.

Endnotes

1. To be clear, I’m not suggesting that, without exception, GPT predicts that a word will follow the most recent one. Were it to behave that way, its response to a prompt would never end. For it outputs every word it predicts, and once it outputs a word, it predicts what would follow that word. (It doesn’t just make one prediction as to what would follow a prompt from the user, but also a distinct prediction of what would follow the sequence for every addition it makes to the sequence.) Its responses are not endless because, sooner or later, it predicts that the hypothetical interlocutor it imitates would stop talking and thus that no word would come next—or, if there would be an additional word, the user would be the one to say it, so GPT has no need to predict it.

2. I leave open whether each of a language’s idioms counts as a “word” in its vocabulary.

3. At least, the training data represent these probabilities as accurately as the words’ actual frequencies do.

4. Note that a sequence, and even a sentence in particular, can contain repetitions of a word. This is why we simply raise to the tenth power; there are options for each of the ten words in the sequence, regardless of how many of those ten words we’ve already selected.

5. Granted, many of the sequences we’re counting are not well-formed sentences, paragraphs, or subsentential expressions, since we are considering every possible arrangement of any ten words, without regard for the constraints of grammar or individual words’ rules of use. But for every ungrammatical or nonsensical ten-word sequence, there must be a perfectly acceptable eleven-, twelve-, thirteen-, or -word sequence that we could add to our count.

6. There are a grand total of four-letter sequences, but only three-letter sequences.

7. We may ignore broken subsequences of , because one would hope an LM would base its “understanding” of a language exclusively on continuous pieces of text in its training sample, so as to keep track of the correct ordering of words (and characters, for that matter). A broken subsequence of is a subsequence of with consecutive terms that are not consecutive terms of (at least, not consecutive terms lying between the positions in corresponding to the endpoints of the subsequence).

8. Note that in calling a “frequency,” we’re saying it’s a number of occurrences rather than a ratio of such numbers. It’s a question of how many whosits, not how many whosits out of whatsits. In other words, the frequency is not the number of whosits divided by the number of whatsits, though, here, the number of whosits we have in mind just so happens to be the number of whosits associated with whatsits, rather than the total number of whosits.

9. Alternatively, if we let be the number of times a particular sequence of words occurs, we can find the number of sequences of words that each occur times.

10. However, it is not commonly known how many of the words in the English vocabulary were used during a given period in the history of the English language.

11. The linked page calls it a “probability density function,” but technically we should use “mass” rather than “density” when talking about discrete random variables, ones whose ranges’ elements have gaps between them (unlike, for instance, the elements of the continuum/set of real numbers).

12. You might think it would make more sense to classify the individual outcomes as events than to classify sets of them as events. But in measure theory probabilities are assigned to sets of outcomes according to how likely it is that the disjunction of those outcomes obtains. Take a standard die. Suppose the numbers you can roll are the possible outputs of our random variable So can take any integer value between and , inclusive. This gives us the range of , but what about ’s domain , i.e., the sample space? To each number there corresponds a side of the die with dots on it. Let be the outcome that the side with dots on it is facing up, after the die is rolled. Then we can take the domain of to be The probability of is then that of the event that , , or is the outcome—in other words, the event that an even number in is rolled. When we conceptualize sets of outcomes this way, as themselves being disjunctive outcomes (though not elements of ), they do start to look like events in the familiar sense—like things that can happen.

13. That’s not due to the order in which the numbers are written in the set notation; regardless of the order they’re written in, it helps to imagine they’re arranged in increasing order.

14. Some sources even suggest that, by definition, the overall probability always increases. They say the range of a discrete r.v. is defined as the set of values such that

15. You may be wondering how the first value in ’s range has a probability mass in that case, or how there could be a jump at The answer is that it’s the difference between and , where the latter of course equals zero.

Next-Word Prediction, Good-Turing Estimation, and Expectations of Sums