See the full description of the workshop on the official event website mlc.combgeo.org.

## Materials

**Section 1**

There is a growing body of results in Ramsey theory which provide better bounds or stronger conclusions under the additional assumption of bounded VC-dimension. After giving a short survey of results of this type, we prove the following theorem which settles a special case of the famous Schur-Erdős problem. There exists a constant $c=c(d)$ such that for any $m$-coloring of the edges of a complete graph with at least $2^{cm}$ vertices, for which the family of the neighborhoods of the vertices in each color has VC-dimension at most $d$, has a monochromatic triangle. This result is best possible up to the value of $c$.

Streaming algorithms are traditionally divided into two categories: deterministic algorithms and randomized ones. These two types of algorithms exhibit a tradeoff between robustness and efficiency: on one hand, deterministic algorithms always return a correct output, but for many important streaming problems, efficient deterministic algorithms do not exist. On the other hand, efficient randomized algorithms are known for a very wide range of problems, but their correctness typically relies on the assumption that the stream is fixed in advance, which is commonly unrealistic in practice.

In this talk, I will focus on a middle ground that combines both robustness and efficiency: adversarially robust algorithms, whose output is correct with high probability even when the stream elements are adaptively chosen as a function of previous outputs. This regime has received surprisingly little attention until recently, and many intriguing problems are still open. In particular, I shall discuss the robustness of random sampling, mention its close connections to combinatorial online learning, and if time permits, present generic techniques to transform “standard” randomized streaming algorithms for various problems into robust ones.

Let $P$ be a set of n points and $H$ a set of $n$ halfspaces in $d$ dimensions. We denote by $ch(P)$ the polytope obtained by taking the convex hull of $P$, and by $fh(H)$ the polytope obtained by taking the intersection of the halfspaces in $H$. Our goal is to decide whether the intersection of $H$ and the convex hull of $P$ are disjoint. Even though ACIT is a natural variant of classic LP-type problems that have been studied at length in the literature, and despite its applications in the analysis of high-dimensional data sets, it appears that the problem has not been studied before.

We discuss how known approaches can be used to attack the ACIT problem, and we provide a very simple strategy that leads to a deterministic algorithm, linear on $n$ and $m$, whose running time depends reasonably on the dimension $d$.

**Section 2**

Comparisons are a classical and well studied algorithmic tool that is used in a variety of contexts and applications. We will discuss two manifestations of the expressiveness of these queries in machine learning and algorithms (a more detailed overview is given below). Both manifestations are based on the notion of “inference dimension” which can be viewed as another instance of the fruitful link between machine learning and discrete mathematics – a link dating back to the discovery of the VC dimension.

- Active classification with comparison queries [Kane, Lovett, Moran, Zhang, FOCS ’17] — Active learning is a model for semi-supervised learning that captures situations in which unlabeled data is abundant and manually labelling it is expensive. We consider an extension of active learning in which the learning algorithm may ask the annotator to compare the distances of two examples from the boundary of their label-class. For example, in a recommendation system application (say for restaurants), the annotator may be asked whether she liked or disliked a specific restaurant (a label query); or which one of two restaurants did she like more (a comparison query). We prove that the usage of comparison queries leads to an exponential improvement in the query complexity of some well studied problems. Specifically, for the class of half-spaces, we show that under natural assumptions, such as large margin or bounded bit-description of the input examples, it is possible to reveal all the labels of a sample of size $n$ using approximately $O(\log n)$ queries.
- Nearly optimal linear decision trees for $k$-SUM and related problems [Kane, Lovett, Moran, JACM ‘19] — We use the above framework to construct linear decision trees for a variety of decision problems in combinatorics and discrete geometry. For example, for any $k$, we construct linear decision trees that solve the $k$-SUM problem on $n$ elements using $O(nk \log n)$ linear queries. Moreover, the queries we use are comparison queries, which compare the sums of two $k$-subsets; when viewed as linear queries, comparison queries are $2k$-sparse and have only
{−1,+1} coefficients. We give similar constructions for sorting sumsets $A+B$ and for solving the SUBSET-SUM problem, both with optimal number of queries, up to poly-logarithmic terms.

The point location problem is a classical problem in discrete and computational geometry. Given a known partition of $R^d$ by $n$ hyperplanes into cells, and an unknown point, find as efficiently as possible the cell in which the point lies. Access to the unknown point is done indirectly, via linear queries with respect to hyperplanes.

This problem is equivalent to another well-studied problem, this time in machine learning: active learning of linear classifiers. Here, there are $n$ known points in $R^d$, which are labeled by an unknown hyperplane. The goal is to as efficiently as possible find the labeling of all $n$ points. Similarly here, access to the unknown hyperplane is done indirectly, in the form of label queries.

For both problems, there is a simple information-theoretic lower bound of $\Omega(d \log n)$ on the number of queries needed. Despite 40 years of work, the best known upper bounds had a polynomial dependence on the dimension $d$. In this work, we achieve for the first time a near-linear dependence on the dimension. The proof combines tools from geometry, analysis and machine learning.

Paper will appear in FOCS 2020. Arxiv: arxiv.org/abs/2004.11380

**Section 1**

We show that in supervised learning there are only three learning rates possible: exponential, linear and arbitrarily slow.

A basic question in learning theory is to identify if two distributions are identical when we have access only to examples sampled from the distributions. This basic task arises in the context of learning, but also in the context of Generative Adversarial Networks (GANs), where a discriminator is trained to distinguish between a real-life distribution and a synthetic distribution. Classically, we use a hypothesis class $H$ and claim that the two distributions are distinct if for some hypothesis in $H$ the expected value on the two distributions is *significantly* different. Our starting point is the following fundamental problem: “is having the hypothesis dependent on more than a single random example beneficial”. To address this challenge we introduce $k$-ary based discriminators which can be modeled by a family of (hyper)-graphs. Each hypergraph is used to test if the distributions are distinct by estimating the probability of an (hyper)-edge. We study the expressiveness of such graph-based discriminators and compare them to the classical setting of learning, which is $k=1$. We show a separation in the expressiveness of $k+1$ vs $k$-ary graph based discriminators and introduce a combinatorial measure, called graph-VC dimension, and show that it controls the sample complexity.

We discuss recent progress in the questions on the VC-dimension of $d$-dimensional polytopes with $k$ facets / $k$ vertices. For the former class, we determine the order of growth of the VC-dimension, which turns out to be superlinear in $k$. For the latter class, we show the first polynomial upper bound on the VC-dimension.

The classical PAC sample complexity bounds are stated for any Empirical Risk Minimizer (ERM) and contain an extra logarithmic factor log(1/ϵ) which is known to be necessary for ERM in general. It has been recently shown by Hanneke (2016) that the optimal sample complexity of PAC learning for any VC class C does not include this log factor and is achieved by a particular improper learning algorithm, which outputs a specific majority-vote of hypotheses in C. This leaves the question of when this bound can be achieved by proper learning algorithms, which are restricted to always output a hypothesis from C.

In this work we aim to characterize the classes for which the optimal sample complexity can be achieved by a proper learning algorithm. We identify that these classes can be characterized by the dual Helly number, which is a combinatorial parameter that arises in discrete geometry and abstract convexity. In particular, under general conditions on C, we show that the dual Helly number is bounded if and only if there is a proper learner that obtains the optimal dependence on ϵ.

As further implications of our techniques we resolve a long-standing open problem posed by Vapnik and Chervonenkis (1974) on the performance of the Support Vector Machine by proving that the sample complexity of SVM in the realizable case is Θ((n/ϵ)+(1/ϵ)log(1/δ)), where n is the dimension. This gives the first optimal PAC bound for Halfspaces achieved by a proper learning algorithm, and moreover is computationally efficient.

We use Littlestone dimension coupled with consistency dimension to characterize efficient learning by equivalence and membership queries. We also briefly describe how these combinatorial measures connect machine learning with model theory, where many examples, for query learning and other kinds of learning, can be found.

We survey a variety of tools and results on extremal problems for graphs of bounded VC-dimension.