This post contains my solution to an exercise from the Computational Learning Theory course taught by Professor Varun Kanade at Oxford University. The problem reads as follows:
PROBLEM (Learning boolean threshold functions). Let and for and , is a boolean threshold function defined by Define the concept class of the threshold functions as Show that unless , there is no efficient proper PAC learning algorithm for .
This problem took me the whole Saturday in library, and after I came up with my solution, I found that my proof was different from the hint given by Kanade (and that’s not bad for me :-)).
Let recall the definition of PAC model introduced by Leslie Valiant in his seminal paper in 1984, which was among the key contributions that led to his 2010 Turing Award:
Definition (PAC model). For , let be a concept class and be a polynomially evaluable hypothesis class over instance space . Let , , and . For any concept , let be the representation size of . Then we say that is PAC learnable using the hypothesis class if there exist an algorithm L satisfying that, for every , for every concept , for every probability distribution over , for arbitrary small constants and , if L is given access to an oracle , and , L outputs that with probability of at least (over the choice of instance ), the prediction error , and that the number of queries that L makes to is bounded by a polynomial in and .
In the above definition, the probability is over the randomness from the oracle as well as any internal randomness of algorithm . By efficient, we mean that the running time of is polynomial in and , and by proper, we requires to output a hypothesis that belongs to , meaning that the hypothesis class is the same as the target concept class , or in other words, learns using .
PROOF IDEA: Now, I will show that there is no efficient algorithm for proper PAC learning , unless . The proof idea is just similar to any other hardness proof: by reducing a known -complete language to PAC learning . Particularly, suppose is a known -complete language and we wish to decide whether a given string/instance is a member of . Given a string , what we need to do are to construct an instance of the PAC learning and then show that we can determine the membership of in via solving the constructed learning instance. More precisely, we need to construct a labeled training data set , and show that there exists a hypothesis that is consistent with if and only if . By consistent, we mean all predictions induced by match the labels in . In addition, needs to be bounded by a polynomial in the length of to guarantee that the reduction can be done in polynomial time.
A question might arise at this point: given a PAC learning algorithm for , how can we decide (in polynomial time with high probability)? It can be done by a general method as follows. Given a string , suppose that we know how to construct so that there is a consistent hypothesis with if and only if . Let be a uniform distribution over . Let select the target error and the confidence parameter . With and , we can simulate an oracle and use it to learn with an error less than and confidence in polynomial time (the concept class here is established by using the labels in ). We then simply check whether the output of the learning algorithm is consistent with . If it is, we accept , otherwise, we reject it.
Why does the above simulation successfully decide (with probability of at least )?
In the case : by our construction of , there is a concept class that is consistent with . By the PAC learning guarantee, the output hypothesis of the learning algorithm is guaranteed with the prediction error less than and confidence . Moreover, by the choice of it must be true that this output hypothesis is consistent with (for if there is even a single incorrect prediction, the error probability is which contradicts the PAC learning guarantee). That is, the learning algorithm output a consistent hypothesis with confidence of .
If : there is no concept that is consistent with and, obviously, the learning algorithm will return an inconsistent output. This can be easily checked in polynomial time.
Therefore, we can decide (with probability of at least ) by checking the output of the learning algorithm.
PROOF: the most challenging part of any hardness proof is choosing which -complete language to reduce from. For this proof, I have tried several problems before succeeding with . In fact, in the problem set on the course webpage, Kanade gave a hint to reduce from the Binary Integer Programming problem (which I haven’t tried yet).
The SET-COVER problem is defined as follows: given a finite collection of finite subsets and an integer , is there a subset of with cardinality at most whose union equals the union of ? The corresponding language is given by: where denotes the index set of subsets in . is well-known to be -complete. Let denote the ground set and be the total number of elements. There are two properties of this problem that motivated me to choose it: i) each element of is covered by at most subsets, and ii) each element of is covered by at least one subset. However, to match the condition in the definition of , these two “thresholds" must be the same value (which is in ). I will show shortly that this hurdle can be overcome with a simple padding trick.
Given , the training set for the PAC learning is constructed as follows. will consist of labeled samples where the first samples are labeled as 1 and the last sample is labeled as 0. Each of the first samples corresponds to one element of the ground set : for , we have given by Precisely, the first entries of the vector indicate which subsets in that contain the th element of , and the last entries are padded with 1. For the last sample which is labeled as 0, we have given by That is, is a vector with value 1 in the first entries and padding value 0 in the last entries. The problem of determining the membership of in is now reduced to check the existence of a hypothesis that is consistent with by PAC learning for the concept class given by:
What remains now is to prove the correctness of the reduction, that is, to show that is in if and only if is consistent with some output of the learning algorithm:
First, suppose the ground set can be covered no more than subsets. Let be the index set of this cover. We define the vector as follows Then the hypothesis is in fact consistent with . Specifically, consider the vector corresponding to the th element of , since every element is covered by at least one subset, we have . Moreover, since the last entries of and those of are all 1, we have . This implies that , which is consistent with the label of . For the last sample, we have that which is also consistent with the label 0. Here the inequality follows from the fact that there are at most positions with value 1 in the first entries of and that all the last entries of are zero by their definition given in .
For the other direction, suppose that the hypothesis is consistent with , the cover of is simply defined as . It is straightforward to verify that this is a valid cover. Specifically, by the consistency, we have that . By the definition of , it must be that , which implies . For the th element of the ground set, also by the consistency, we have that . It follows that , or there is at least one chosen subset that covers this element due to the definition of as given in .
Now we know how to reduce to PAC learning . If there is an efficient proper PAC learning algorithm for , then we would have a randomized algorithm that decides an -complete language in polynomial time with error probability less than . This would imply that and, since (it is widely believed that) , it must be the case that such a learning algorithm, if it exists, is inefficient. This completes the proof.