Core ML

Stochastic Cellular Automata for Inference

In this project, we advocate perhaps the most embarrassingly parallel of all computational models, stochastic cellular automata (SCA) is sufficient for inference on large class of latent variable models. In SCA, the atomic unit of computation is the cell, and each cell interacts with a set of neighbouring cells via a local update rule. At each time-step, every cell in the automaton updates simultaneously, thus enabling extreme parallelism. Given this computation model, the challenge is to design update rules that correctly implement the desired inference algorithm in our choice of statistical model. Applied to latent Dirichlet allocation we find that our algorithm is over an order of magnitude faster than the current state-of-the art. A simple C++/MPI implementation on a 4-node cluster samples 570 million tokens per second. We process 3 billion documents and achieve predictive power competitive with collapsed Gibbs sampling and variational inference. We are looking forward to more interesting models such as restricted Boltzmann machines (RBMs) or models combining deep networks and Bayesian models.

Fast Alias Sampling Technique and Analysis

The objective of this project is to scale Bayesian Nonparametrics such as topic models, clustering, and hierarchical representations to Big Data by overcoming the computational issues. In this pursuit, we are trying to exploit the fact that clusters, topics, trees, or other structures far away from the current set of observations are almost irrelevant and thus allowing us to ignore them. This lets us obtain a proposal distribution cheaply for cluster membership that does not require all comparisons. Then we will use Metropolis-Hastings (MH) algorithm along with alias techniques for fast sampling. However, a big problem in applying such alias techniques to Bayesian Nonparametrics is that possible number of categories is not pre-determined and keeps changing during the inference procedure. If a category is inserted or deleted we cannot use the stale distribution for MH sampling because now domain of proposal doesn't contain domain of target distribution. So we cannot directly cache alias tables for fast sampling from the posteriors. In order to circumvent this problem we are exploring smart data structure for storing the Alias table, which can allow cheap updates to insert a new possible outcome or delete an existing outcome.

Gaussian LDA for Topic Models with Word Embeddings

Continuous space word embeddings learned from large, unstructured corpora have been shown to be surprisingly effective at capturing semantic regularities in language. While such word embeddings have been incorporated to produce state-of-the-art results in numerous supervised natural language processing tasks from the word level to document level; however, they have played a more minor role in unsupervised learning problems. This work shows some of the promise that they hold in this domain. In particular, we replace LDA’s parameterization of “topics” as categorical distributions over opaque word types with multivariate Gaussian distributions on the embedding space. This encourages the model to group words that are a priori known to be semantically related into topics. To perform inference, we introduce a fast collapsed Gibbs sampling algorithm based on Cholesky decompositions of the covariance matrices of the posterior predictive distributions. <> Using vectors learned from a domain general corpus (English Wikipedia), we report results on two document collections (20-newsgroups and NIPS). Qualitatively, Gaussian LDA infers different (but still very sensible) topics relative to standard LDA. Quantitatively, our technique outperforms existing models at dealing with OOV words in held-out documents. More broadly still, running LDA on documents consisting of different modalities than just text is facilitated by using the lingua franca of vector space representations, so we expect numerous interesting applications in this area.

Topic Modelling for WWW

The objective of this project is to come up with a better generative model for graphs, wherein both edge and vertex properties are captured. In other words, the topic of a page are function of both content of the page and links to which connects. Also we hope to preserve the power law properties of the graphs, i.e. vertex degree distribution, triangles, etc. Such a model would be useful in detecting if some parts of the graph are spam or ham, or come up with a good model of relevance, or predict how they will extend, or also to identify the content / parameters of a vertex given the rest. We are working towards developing a good theoretical graph generative model which better explains the WWW and an efficient inference algorithm for the said purpose which can handle Amazon public web-crawl which has about 6 billion pages with links and content.

Spectral Methods for Hierarchical Dirichlet Process

Nonparametric Bayesian mixture models are often used to solve the problems of topic modelling or clustering. A popular tool is the Hierarchical Dirichlet process (HDP) mixture model. To estimate the parameters, commonly used algorithms are online variational Bayes method or Gibbs sampling as Chinese restaurant franchise. However, these approaches typically make computations prohibitively expensive. In this paper we introduce a computationally and statistically efficient spectral algorithm for inferring parameters in HDPs using tensor decomposition. The lower-order moments of HDP parameters are derived and we provide theoretical guarantees for recovery accuracy. Moreover, we propose a hybrid method which allows us to combine spectral and likelihood approaches. Finally, we show that the resulting algorithm is both faster and outperforms it in terms of the predictive power of the trained model by applying it to two corpora, namely Enron emails and Wikipedia.