Compression of Integers.

Being a student of computer science myself, whenever I have to optimize code, generally we have to compromise on space complexity. For example, consider the problem where you have to return the indices () of two elements from an integer array nums such that nums[i] + nums[j] = target.

There are two ways to solve this. The brute-force approach dictates running two nested for loops to find the indices. Another approach is to store the elements and their respective indices into a map, and retrieve the indices from there.

In the brute-force method, the time complexity is in the worst case, and the space complexity is (ignoring the space required to store the answer). In the optimized case, we achieve a time complexity of , but the space complexity increases to in the worst case. Here, represents the number of elements in the array.

This simple tradeoff led me to wonder: could we compress the integers themselves? If we could somehow encode that entire hash map—or the array itself—into a single integer that only takes space under the Uniform Cost Model, we could theoretically break the classic time-space tradeoff. That is, suppose we are given a sequence of integers , and we want to find a single integer that can carry all the information from that entire sequence.

Note that in computer science, as an integer grows, it requires more bits to store it in memory. So, even if we compress these integers into one, the resulting number can be so large that it ultimately requires the same number of bits. Memory-wise, this might not be efficient in reality. However, here we are speaking strictly in mathematical terms, assuming that storing a single integer —no matter if —still requires exactly one unit of storage. Generally, theoretical models that require unit storage even for arbitrarily large integers are known as the Uniform Cost Model.

Assumptions and Terminology

Before we actually go into the methods to encode the numbers, for consistency and to ensure we are all on the same page, it is important to lay out the formal definitions. Don’t worry—it will not be that difficult to understand.

The Domain: Let be the set of all finite sequences (including the empty sequence) of non-negative integers. So, an element looks like where .

The Codomain: Let the target space be the set of non-negative integers .

The Encoding Function: We are looking to define a function .

Requirement: The function must be injective. That is, for every unique , there exists an with a one-to-one mapping such that . Mathematically, if we have two sequences and , then

Also note that here, we refer to as a sequence rather than a set. In pure math, standard sets cannot contain duplicate elements and have no strict order. But in our case—much like an array in computer science—we absolutely care about duplicate elements, and the exact order of those elements matters. If two sequences contain the same numbers but in a different order or with different frequencies, they are entirely different sequences.

Even though I just stated the mathematical definitions, I want to give you a rough idea of these terms and symbols so that you can carry them forward without much effort. Here, represents any sequence formed by arbitrarily choosing non-negative integers. In computer science terms, it is basically an array. Now let’s discuss a little bit. One thing to remember about is that for every unique input , there must be one and only one unique output . Why? Because suppose two different sequences and have the same output —then how do we identify which input was given to the function? To avoid this ambiguity, the function must be injective.

Don’t worry about the codomain thing. It is just there to define the function clearly (and to scare you), but just remember the output must be a non-negative integer. Treat it like a return type in functions. That’s it.

Now to the question of surjection. That is, for every in the codomain, there must be a sequence . In simpler terms, you gave me the number , and then does there exist an array which on compression results in ? Now this is interesting. Because we are not binding our to be surjective. But it will be interesting to note that depending upon techniques, we will look at whether is surjective or not. And once it becomes surjective, it will be a bijection.

Concatenation

Many mathematical techniques exist to achieve this encoding. Before going into highly abstract methods, a common and intuitive method we can instantly think of is the concatenation of all integers in the sequence.

Before defining this method mathematically, let’s rigorously define what a "digit length" is using the floor function . Consider an integer . The digit length is defined as the smallest integer such that:

Now, let’s jump to the first method.

Let’s first define pure, unpadded concatenation mathematically. We define a function on a sequence . The standard concatenation is:

More compactly, using summation notation:

I know it looks dangerous! But the meaning is simply putting the numbers together side-by-side. Suppose , then:

Just that. It was me trying to be mathematical! You don’t actually need to evaluate that horrifying equation by hand.

This pure concatenation is clearly effective when the digit length for every element in the sequence is exactly the same. However, a problem arises when we attempt to decode it. Suppose you are given . To retrieve , you need to divide into perfect blocks of digits. But what must the block size be?

To solve this, we must encode the uniform block size—let’s call it —directly into our resulting integer. Furthermore, we must account for sequences with varying digit lengths by padding them.

Here is the complete logic for our injective function :

First, find the maximum digit length in the sequence, . Pad all numbers in the sequence with leading zeros so that every element has exactly digits. Next, concatenate them. Finally, to encode the block size , pick a digit that is strictly different from the last digit of your concatenated number, and append it exactly times at the end.

Let’s standardize the picking of this digit. Let’s call it the appending digit, denoted by , defined entirely by the very last integer in the sequence, :

Because of this definition, is guaranteed to be different from the last digit of the concatenated block.

Let’s look at . The block size is . The last digit of is . So, . We append exactly times:

How do we get the sequence back? Count the trailing identical digits until the digit changes. There are two s, so our block size is . Discard the s, and divide the remaining number into blocks of .

Now, let’s look at a sequence with varying digits, which forces padding. Suppose .The maximum block size is . Padding the sequence gives us:

Concatenating yields . The last integer ends in , so . We append it times:

Notice that mathematically, the integer loses its leading zeros! Because of this, we must always start dividing into blocks from right to left during the decoding phase. If we attempt to decode (after removing the s) from the left, the missing zeros will cause catastrophic errors. By parsing in blocks of from the right, the structure perfectly reconstructs itself, naturally restoring the implicit leading zeros.

But here is the twist. Try to encode and .

The Mathematical Definition of :

We can actually define this entire padding, concatenation, and appending process in one elegant mathematical formula. By treating each padded number as a “digit" in base-, the implicit leading zeros are handled naturally by the powers of 10. The repeated appending digit can be generated using the repdigit formula .

For a non-empty sequence of length :

(Note: If is the empty sequence, we can simply define to maintain injectivity).

Notice the addition of the term at the beginning of our formula. This acts as a mathematical sentinel value (or anchor). To understand why this is strictly necessary for injectivity, we must remember that pure integers do not retain leading zeros.

Suppose we have two distinct sequences: Both have a maximum digit length of . If we were to encode them without our sentinel value, relying only on concatenation and the appending digit ( for both), the implicit leading zero in would vanish algebraically:

We hit a collision: utterly destroying our injectivity.

By adding , we are effectively placing a permanent exactly one block-width to the left of our sequence. This acts as a physical wall, trapping any leading zeros inside the integer’s magnitude.

During decoding, once we parse the integer right-to-left into blocks of size , the final block will always be exactly . When we hit this sentinel block, we simply discard it and terminate the process, perfectly preserving all zeros that came before it. That’s it. This is our first basic, yet perfectly injective, compression method. Let’s talk about surjectivity little bit. Because not every integer ends in a valid sequence of identical digits that perfectly divides the remaining leading digits into blocks, this function is strictly injective, but clearly not surjective.

Now as being a student of mathematics, can we prove that is injective? The answer is yes. We can. So, let me state this as a theorem.

Proof. By magic. ◻

Modulo-Quotient Decomposition

What do we do in this method? Well before that, we must just skim through the Euclidean Division Lemma. The lemma states that for every positive integer and , there exist unique integers and such that

where . This lemma is a complicated way to say that there exist a unique quotient () and remainder () for every division.

What do we see that’s interesting here? That can uniquely be broken into and . So, we are doing ‘ "decomposition" using remainders. Hence the name of the method.

So how can we use this to encode our sequence into a single integer? Well, taking inspiration from Golomb Coding, just encode the quotient and remainder instead of the elements of the sequences. How exactly?

Choose a prime immediately greater than the mean of sequence .
Get the quotients. Let me mathematically write it. Suppose be the sequence of quotients.
Get the remainder sequence as follows.
Use to encode . That’s it. Basically do the following.

So, let me mathematically write this new function.

In this method, it is not necessary to use prime. It can be anything or any number, even just the floor of the mean. But I am choosing it to be prime because I feel like it and I am very moody.

Let’s try to encode , and will be encoded by angels.

So, . So, the value would be . The prime would be . Now would be and .

Okay, now and , and we already have prime . Because our new sequence is a 3-tuple and has 16 digits, our maximum block size is now . We pad everything to 16 digits, calculate our sentinel value, and concatenate to get :

The method for decomposition is the same. Count appending digits and remove them. Divide into blocks, remove the sentinel value, and get D and R. And do the same for and and use to get the sequence .

Also, is injective but not necessarily surjective, as it is just in some different form. ’

Polynomial Interpolation over a Finite Field

Until now, our methods have been more “algorithmic" than mathematical—we were essentially writing equations for algorithms. We haven’t fully embraced mathematical rigor, but the time has come to do so.

As the name suggests, we are discussing polynomial interpolation. What exactly is that? Fundamentally, it is like drawing a graph. In interpolation, we are given a set of points and must find an equation that satisfies them. In other words, we seek the unique graph passing through those points.

Let us take our sequence . Suppose we want to find a polynomial that satisfies the following points: Mathematical law dictates that for any points, there exists exactly one unique polynomial of degree at most that passes through all of them. This is known as Lagrange Interpolation. The "wild idea" here is this: instead of compressing the sequence directly, we construct this polynomial and simply extract its coefficients!

Let be the sequence of our coefficients:

We now only need to encode . This can be done using any preferred method, such as concatenation or modulo-quotient decomposition. As a tip, try to use the method that results in the fewest digits (e.g., use concatenation when there are no outliers).

However, if you have studied Lagrange Interpolation before, you know that the coefficients usually end up being rational numbers. But our domain is the set —the set of all finite sequences of non-negative integers. We need our coefficients to be non-negative integers as well. This is exactly where the Finite Field (specifically a Galois Field, denoted as or ) enters the picture.

In this method, we utilize the “moody" prime , ensuring that and . Why? Because by performing our interpolation entirely within , fractions become fundamentally impossible. Every coefficient is mathematically guaranteed to be a “clean" integer such that .

The basis for this interpolation modulo is as follows: for each point from to , the basis polynomial is defined as: Our final polynomial is the sum of these bases multiplied by our sequence values:

When operating in , every calculation is performed modulo . Suppose your sequence is and you choose . Because you are in , the becomes . Your sequence transforms into before the polynomial is even built, resulting in a permanent loss of original data and threatening injectivity. Therefore, the prime must be a “ceiling" higher than your largest value to ensure the sequence remains unique.

Let’s use this method to encode . First, we find . We choose (the prime immediately greater than ). For : After calculating the remaining terms, we form the polynomial : Now we simply encode . (Note: coefficients are listed from to ).

Is this method injective? Lagrange interpolation is strictly bijective, but there is one potential problem. Suppose we have . The polynomial is simply , giving . Now consider . Its polynomial is also . If we only look at the coefficients, we hit a collision! How do we know if should be evaluated 3 times or 5 times?

We resolve this by refining our definition:

Coefficient sequence (): The coefficient sequence of the polynomial must have a length strictly equal to the length of .

Under this rule, and . The collision is avoided, and the method remains strictly injective.

Decoding is laughably simple: evaluate the polynomial at the known indices: Since we also need the “moody" prime to retrieve the sequence, the final encoding formula is:

Don’t tell me that domain of above equation is wrong! We append the moody prime to the end of our coefficient sequence , and pass the entire flat sequence into our function.

Proof. Do I need to prove it again? ◻

Compression of Integers.

Assumptions and Terminology

Concatenation

Modulo-Quotient Decomposition

Polynomial Interpolation over a Finite Field

Add a comment