Congratulations 2019 Faculty and Research Engagement Program (FREP) Recipients!

Yahoo Research is excited to announce the 2019 Faculty and Research Engagement Program (FREP) recipients. This year, we received 100+ proposals from a variety of prestigious institutions around the world. The competition was intense, the review process was difficult, and making the final decisions wasn’t easy. The grants will support professors and students who explore a diverse set of fields, including machine learning, distributed systems, online security, content understanding and recommendation, and images and video understanding.

FREP awards grants to faculty members in support of research to enhance people’s lives by improving the internet. FREP was founded in 2012 to foster cutting-edge collaborations between scientists in academic settings and those at Yahoo Research. We look forward to the insights, scientific advances, and relationships that will grow from FREP over the coming year and for many years to come!

Congratulations to these very impressive researchers:

  • Acceleration for Data Science and Machine Learning
  • Scalable Online Detection of Complex Patterns in Rapid Event Streams
  • Optimal-Transport Bayesian Sampling with Applications to Repulsive Attentions in NLP
  • Interactive learning from weak annotations
  • Deep Learning for Analyzing Ultrasound Movie Images
  • Detecting Intrinsic Visual Privacy Threats
  • PASTE: PArallel Synthesis, Training and Enhancement via Distributionally Robust Optimization and Optimal Transport
  • Representation Learning for Product Graphs
  • Large-scale multi-objective sequential decision making
  • Large-Scale Graph Embeddings
  • Communication-Efficient Federated Learning
  • Adversarial Reformulation-Aware Query Suggestion with Graph Convolutional Networks
  • Modeling Temporal Dynamics of User Behavior for Improved Advertising

Yahoo Research Wins Runner-Up Best Paper Award for “Time-Aware Prospective Modeling of Users for Online Display Advertising” at AdKDD

By Kim Capps-Tanaka, Chief of Staff, Yahoo Research

KDD 2019 in Anchorage, Alaska, has been fantastic so far and yesterday was especially exciting as we won AdKDD’s Runner-Up Best Paper Award for “Time-Aware Prospective Modeling of Users for Online Display Advertising”.

Congratulations to Djordje Gligorijevic (Research Scientist), Jelena Gligorijevic (Research Scientist) and Aaron Flores (Sr. Director)! 

If you’re at KDD, we’d love to chat with you! Stop by booth #39 or any of the poster sessions below:

  • “Predicting Different Type of Conversions using Multi-Task Learning”, Junwei Pan, Yizhi Mao, Alfonso Ruiz, Yu Sun, Aaron Flores
    • Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall
  • “Carousel Ads Optimization in Yahoo Gemini Native”, Oren Somekh, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev
    • Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall
  • “Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, Narayan Bhamidipati
    • Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall
  • “Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, Nicolas Grislain, Nicolas Perrin, Antoine Thabault
    • Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

* Bold authors denotes Yahoo Researchers

Thanks!

Meet Yahoo Research at KDD 2019

By Kim Capps-Tanaka, Chief of Staff, Yahoo Research

If you’re attending KDD in Anchorage, Alaska, the Yahoo Research team would love to meet you! Send us an email or tweet to discuss research or job opportunities on the team.

In addition to hosting a booth, we’re excited to present papers, posters, and talks. 

Sunday, August 4th

“Modeling and Applications for Temporal Point Processes”, Junchi Yan, Hongteng Xu, Liangda Li

  •  8am - 12pm, Summit 8-Ground Level, Egan

Monday, August 5th

“Time-Aware Prospective Modeling of Users for Online Display Advertising”, Djordje Gligorijevic, Jelena Gligorijevic, Aaron Flores

  • 8:40am - 9am, Kahtnu 2 - Level 2, Dena’ina

“The Future of Ads”, Brendan Kitts

  • 3pm-3:30pm, Kahtnu 2 - Level 2, Dena’ina

“Learning from Multi-User Activity Trails for B2B Ad Targeting”, Shaunak Mishra, Jelena Gligorijevic, Narayan Bhamidipati

  • 4:35pm-4:55pm, Kahtnu 2- Level 2, Dena’ina

“Automatic Feature Engineering From Very High Dimensional Event Logs Using Deep Neural Networks”, Kai Hu, Joey Wang, Yong Liu, Datong Chen

  • 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

Tuesday, August 6th

“Predicting Different Type of Conversions using Multi-Task Learning”, Junwei Pan, Yizhi Mao, Alfonso Ruiz, Yu Sun, Aaron Flores

  • 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

“Carousel Ads Optimization in Yahoo Gemini Native”, Oren Somekh, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev

  • 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

“Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, Narayan Bhamidipati

  • 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

“Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, Nicolas Grislain, Nicolas Perrin, Antoine Thabault

  • 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall

* Bold authors denotes Yahoo Researchers

Hope to see you at KDD!

Introducing the Yahoo News Ranked Multi-label Corpus, a Novel Dataset to Improve Multilabel Learning


By Akshay Soni, Aasish Pappu

Most content-based websites, like Yahoo News, HuffPost, or any given news site, organize their stories according to subject matter or in some similar way. You can imagine that websites with a huge amount of stories must need an automated method to filter or categorize them as the content is ingested into their systems. For example, algorithms that power Yahoo News label news articles with tags (e.g., Military conflict, Nuclear policy, Refugees) as they are ingested, and then display the content by subject matter and/or on a personalized feed. This well-known process of labeling content with all its relevant tags is known as Multilabel Learning (MLL).

image
An overview of a MLL system in action: as the news articles are ingested, MLL tags them with all the relevant labels.

Up to now, whenever scientists and engineers use MLL to create their own specific models to label content however they like, they have used datasets that have pre-computed features like bag-of-words, or dense representations like doc2vec. In the process of writing our recent ACL 2017 publication “DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging,” presented in the Rep4NLP workshop, we compiled a novel dataset with raw news stories that allows for researchers to be able to construct their own features that are best-suited for their MLL algorithms. As part of our Webscope data-sharing program, we are making our dataset, called Yahoo News Ranked Multi-label Corpus (YNMLC), available to academics to further advance MLL research.

While traditional MLL approaches rely on given features, DocTag2Vec operates on raw text and automatically learns the best features of that text by embedding both documents and the tags in the same vector space. Inference is then done via a simple nearest-neighbor based approach. DocTag2Vec relies on training data that is composed of the raw text of every document and the labels associated with them.

image
DocTag2Vec embeds documents and the labels associated with them in a common vector-space. This allows inference by a nearest-neighbor approach.

There are many standard datasets available for MLL, but all of them directly provide features and not the actual text of the documents. This allows researchers to work on new algorithms that directly use the provided features but without improving the features themselves. Our YNMLC corpus provides raw text so that researchers can extract their own features that are best for their algorithms. Apart from that, to the best of our knowledge, our corpus is the only one that provides a ranking of the labels for each document in terms of its importance. YNMLC is one of the few large-scale, expertly manually-labeled (by Yahoo News editors) datasets addressing the task of MLL. The corpus contains 48,968 articles that are tagged by any subset of 413 labels. These tags correspond to Vibes (akin to topics) in the Yahoo Newsroom app.

MLL is an area of research that we have applied to labeling news stories. MLL can also be used to label music, videos, blog posts, and virtually any other type of online content. We look forward to seeing the innovative ways in which YNMLC will be used to develop new approaches.

Please cite the following paper if you are using this dataset for academic purposes:

“DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging”, The 2nd Workshop on Representation Learning for NLP, 2017. Sheng Chen, Akshay Soni, Aasish Pappu, and Yashar Mehdad.

HBase Goes Fast and Lean with the Accordion Algorithm

By Edward Bortnikov, Anastasia Braginsky, and Eshcar Hillel

image

Modern products powered by NoSQL key-value (KV-)storage technologies exhibit ever-increasing performance expectations. Ideally, NoSQL applications would like to enjoy the speed of in-memory databases without giving up on reliable persistent storage guarantees. Our Scalable Systems research team has implemented a new algorithm named Accordion, that takes a significant step toward this goal, into the forthcoming release of Apache HBase 2.0.

HBase, a distributed KV-store for Hadoop, is used by many companies every day to scale products seamlessly with huge volumes of data and deliver real-time performance. At Yahoo, HBase powers a variety of products, including Yahoo Mail, Yahoo Search, Flurry Analytics, and more. Accordion is a complete re-write of core parts of the HBase server technology, named RegionServer. It improves the server scalability via a better use of RAM. Namely, it accommodates more data in memory and writes to disk less frequently. This manifests in a number of desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously.

We stress-tested Accordion-enabled HBase under a variety of workloads. Our experiments exercised different blends of reads and writes, as well as different key distributions (heavy-tailed versus uniform). We witnessed performance improvements across the board. Namely, we saw write throughput increases of 20% to 40% (depending on the workload), tail read latency reductions of up to 10%, disk write reductions of up to 30%, and also some modest Java garbage collection overhead reduction. The figures below further zoom into Accordion’s performance gains, compared to the legacy algorithm.

image
Figure 1. Accordion’s write throughput compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf (heavy-tailed) and Uniform primary key distributions.

image
Figure 2. Accordion’s read latency quantiles compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.

image
Figure 3. Accordion’s disk I/O compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.

Accordion is inspired by the Log-Structured-Merge (LSM) tree design pattern that governs HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts multi-versioned concurrency control – that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called compaction, removes the redundant cells and creates larger files.

LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees were designed in the age when RAM was in very short supply, and therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore size managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization. 

Accordion reapplies the LSM principle to MemStore in order to eliminate redundancies and other overhead while the data is still in RAM. The MemStore memory image is therefore “breathing” (periodically expanding and contracting), similarly to how an accordion bellows. This work pattern decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint. 

With fewer flushes, the write operations are stalled less frequently as the MemStore overflows, and as a result, the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having fewer disk writes also means having less compaction happening in the background, i.e., fewer cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be thought of as a catalyst that enables the system to move faster as a whole. 

Accordion currently provides two levels of in-memory compaction: basic and eager. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. Future implementations may tune the optimal compaction policy automatically. 

Accordion replaces the default MemStore implementation in the production HBase code. Contributing its code to production HBase could not have happened without intensive work with the open source Hadoop community, with contributors stretched across companies, countries, and continents. The project took almost two years to complete, from inception to delivery. 

Accordion will become generally available in the upcoming HBase 2.0 release. We can’t wait to see it power existing and future products at Yahoo and elsewhere.

Researching the Definition of Good Online Conversations and How They Should Rank with the Yahoo News Annotated Comments Corpus

By Courtney Napoles, Aasish Pappu, Joel Tetreault

Comment threads following online news articles often range from vacuous to hateful. That said, good conversations do occur online, with people expressing different viewpoints and attempting to inform, convince, or better understand the other side, even if they can get lost among the multitude of unconstructive comments. At Yahoo Research, we show in recent statistical experiments that automatically identifying and ranking good conversations on top will cultivate a more civil and constructive atmosphere in online communities and potentially encourage participation from more users [1].

In an effort to foster more respectful online discussions and encourage more research among academics surrounding comments, we present the Yahoo News Annotated Comments Corpus (YNACC) via our data sharing program, Webscope. The corpus contains 522K comments from 140K comment threads posted in response to online news articles, and contains manual annotations for a subset of 2.4K comment threads and 9.2K comments. The annotations include 6 attributes of individual comments: sentiment, tone, agreement with other commenters, topic of the comment, intended audience, and persuasiveness. The annotations also include 3 attributes of threads: constructiveness, agreeability within the conversation, and type of conversation, i.e., flamewars vs positive/respectful [2].

Annotated conversations in the YNACC corpus were used to create a predictive algorithm and train statistical models to automatically detect “good” conversations. We call these good conversations ERICs: Engaging, Respectful, and/or Informative Conversations, and they are characterized by:

  • A respectful exchange of ideas, opinions, and/or information in response to a given topic or topics.
  • Opinions expressed as an attempt to elicit a dialogue or persuade.
  • Comments that seek to contribute some new information or perspective on the relevant topic.

image

image

Example of an ERIC conversation (top) and a non-ERIC conversation (bottom)

ERICs have no single identifying attribute. A good conversation is determined by how many respectful, engaging, and persuading comments are present. For instance, an exchange where communicants are in total agreement throughout can be an ERIC, as can an exchange with a heated disagreement. Our algorithm ranks either of these types of exchanges higher than those that lack ERICs. Many of the labels for the ERICs in our dataset are the result of a new coding scheme (annotation taxonomy) we developed and are for characteristics of online conversations not captured by traditional argumentation or dialogue features. Some of the labels we collected have been annotated in previous work [3,4], and this is the first time they are aggregated in a single corpus at the dialogue level.

Additionally, we collected annotations on 1K threads from the Internet Argument Corpus, representing another domain of online debates. Our corpus and annotation scheme is the first exploration of how characteristics of individual comments contribute to the dialogue-level classification of an exchange. We hope YNACC will facilitate research to understand ERICs and other aspects of dialogue in general.

The technical contributions of this dataset are described in two scientific papers:

[1] Courtney Napoles, Aasish Pappu, and Joel Tetreault. “Automatically Identifying Good Conversations Online (Yes, they do exist!)”. In Proceedings of ICWSM'17.

[2] Courtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato and Brian Provenzale. 2017. “Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus”. In Proceedings of The 11th Linguistic Annotation Workshop (LAW-XI).

References:

[3] Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. Internet Argument Corpus 2.0: An SQL schema for dialogic social media and the corpora to go with it. LREC 2016.

[4] Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. A corpus for research on deliberation and debate. LREC 2012.

Introducing Similarity Search at Flickr

By Clayton Mellina, Software Development Engineer

At Yahoo, our Computer Vision team works closely with Flickr, one of the world’s largest photo-sharing communities. The billions of photos hosted by Flickr allow us to tackle some of the most interesting real-world problems in image and video understanding. One of those major problems is that of discovery. We understand that the value in our photo corpus is only unlocked when the community can find photos and photographers that inspire them, so we strive to enable the discovery and appreciation of new photos.

To further that effort, today we are introducing similarity search on Flickr. If you hover over a photo on a search result page, you will reveal a “…” button that exposes a menu that gives you the option to search for photos similar to the photo you are currently viewing.

In many ways, photo search is very different from traditional web or text search. First, the goal of web search is usually to satisfy a particular information need, while with photo search the goal is often one of discovery; as such, it should be delightful as well as functional. We have taken this to heart throughout Flickr. For instance, our color search feature, which allows filtering by color scheme, and our style filters, which allow filtering by styles such as “minimalist” or “patterns,” encourage exploration. Second, in traditional web search, the goal is usually to match documents to a set of keywords in the query. That is, the query is in the same modality—text—as the documents being searched. Photo search usually matches across modalities: text to image. Text querying is a necessary feature of a photo search engine, but, as the saying goes, a picture is worth a thousand words. And beyond saving people the effort of so much typing, many visual concepts genuinely defy accurate description. Now, we’re giving our community a way to easily explore those visual concepts with the “…” button, a feature we call the similarity pivot.

image

The similarity pivot is a significant addition to the Flickr experience because it offers our community an entirely new way to explore and discover the billions of incredible photos and millions of incredible photographers on Flickr. It allows people to look for images of a particular style, it gives people a view into universal behaviors, and even when it “messes up,” it can force people to look at the unexpected commonalities and oddities of our visual world with a fresh perspective.

What is “similarity?”

To understand how an experience like this is powered, we first need to understand what we mean by “similarity.” There are many ways photos can be similar to one another. Consider some examples.

image
image
image

It is apparent that all of these groups of photos illustrate some notion of “similarity,” but each is different. Roughly, they are: similarity of color, similarity of texture, and similarity of semantic category. And there are many others that you might imagine as well.

What notion of similarity is best suited for a site like Flickr? Ideally, we’d like to be able to capture multiple types of similarity, but we decided early on that semantic similarity—similarity based on the semantic content of the photos—was vital to wholly facilitate discovery on Flickr. This requires a deep understanding of image content for which we employ deep neural networks.

We have been using deep neural networks at Flickr for a while for various tasks such as object recognition, NSFW prediction, and even prediction of aesthetic quality. For these tasks, we train a neural network to map the raw pixels of a photo into a set of relevant tags, as illustrated below.

image

Internally, the neural network accomplishes this mapping incrementally by applying a series of transformations to the image, which can be thought of as a vector of numbers corresponding to the pixel intensities. Each transformation in the series produces another vector, which is in turn the input to the next transformation, until finally we have a vector that we specifically constrain to be a list of probabilities for each class we are trying to recognize in the image. To be able to go from raw pixels to a semantic label like “hot air balloon,” the network discards lots of information about the image, including information about  appearance, such as the color of the balloon, its relative position in the sky, etc. Instead, we can extract an internal vector in the network before the final output.

image

For common neural network architectures, this vector—which we call a “feature vector”—has many hundreds or thousands of dimensions. We can’t necessarily say with certainty that any one of these dimensions means something in particular as we could at the final network output, whose dimensions correspond to tag probabilities. But these vectors have an important property: when you compute the Euclidean distance between these vectors, images containing similar content will tend to have feature vectors closer together than images containing dissimilar content. You can think of this as a way that the network has learned to organize information present in the image so that it can output the required class prediction. This is exactly what we are looking for: Euclidian distance in this high-dimensional feature space is a measure of semantic similarity. The graphic below illustrates this idea: points in the neighborhood around the query image are semantically similar to the query image, whereas points in neighborhoods further away are not.

image

This measure of similarity is not perfect and cannot capture all possible notions of similarity—it will be constrained by the particular task the network was trained to perform, i.e., scene recognition. However, it is effective for our purposes, and, importantly, it contains information beyond merely the semantic content of the image, such as appearance, composition, and texture. Most importantly, it gives us a simple algorithm for finding visually similar photos: compute the distance in the feature space of a query image to each index image and return the images with lowest distance. Of course, there is much more work to do to make this idea work for billions of images.

Large-scale approximate nearest neighbor search

With an index as large as Flickr’s, computing distances exhaustively for each query is intractable. Additionally, storing a high-dimensional floating point feature vector for each of billions of images takes a large amount of disk space and poses even more difficulty if these features need to be in memory for fast ranking. To solve these two issues, we adopt a state-of-the-art approximate nearest neighbor algorithm called Locally Optimized Product Quantization (LOPQ).

To understand LOPQ, it is useful to first look at a simple strategy. Rather than ranking all vectors in the index, we can first filter a set of good candidates and only do expensive distance computations on them. For example, we can use an algorithm like k-means to cluster our index vectors, find the cluster to which each vector is assigned, and index the corresponding cluster id for each vector. At query time, we find the cluster that the query vector is assigned to and fetch the items that belong to the same cluster from the index. We can even expand this set if we like by fetching items from the next nearest cluster.

This idea will take us far, but not far enough for a billions-scale index. For example, with 1 billion photos, we need 1 million clusters so that each cluster contains an average of 1000 photos. At query time, we will have to compute the distance from the query to each of these 1 million cluster centroids in order to find the nearest clusters. This is quite a lot. We can do better, however, if we instead split our vectors in half by dimension and cluster each half separately. In this scheme, each vector will be assigned to a pair of cluster ids, one for each half of the vector. If we choose k = 1000 to cluster both halves, we have k2 = 1000 * 1000 = 1e6 possible pairs. In other words, by clustering each half separately and assigning each item a pair of cluster ids, we can get the same granularity of partitioning (1 million clusters total) with only 2*1000 distance computations with half the number of dimensions for a total computational savings of 1000x. Conversely, for the same computational cost, we gain a factor of k more partitions of the data space, providing a much finer-grained index.

This idea of splitting vectors into subvectors and clustering each split separately is called product quantization. When we use this idea to index a dataset it is called the inverted multi-index, and it forms the basis for fast candidate retrieval in our similarity index. Typically the distribution of points over the clusters in a multi-index will be unbalanced as compared to a standard k-means index, but this unbalance is a fair trade for the much higher resolution partitioning that it buys us. In fact, a multi-index will only be balanced across clusters if the two halves of the vectors are perfectly statistically independent. This is not the case in most real world data, but some heuristic preprocessing—like PCA-ing and permuting the dimensions so that the cumulative per-dimension variance is approximately balanced between the halves—helps in many cases. And just like the simple k-means index, there is a fast algorithm for finding a ranked list of clusters to a query if we need to expand the candidate set.

After we have a set of candidates, we must rank them. We could store the full vector in the index and use it to compute the distance for each candidate item, but this would incur a large memory overhead (for example, 256 dimensional vectors of 4 byte floats would require 1Tb for 1 billion photos) as well as a computational overhead. LOPQ solves these issues by performing another product quantization, this time on the residuals of the data. The residual of a point is the difference vector between the point and its closest cluster centroid. Given a residual vector and the cluster indexes along with the corresponding centroids, we have enough information to reproduce the original vector exactly. Instead of storing the residuals, LOPQ product quantizes the residuals, usually with a higher number of splits, and stores only the cluster indexes in the index. For example, if we split the vector into 8 splits and each split is clustered with 256 centroids, we can store the compressed vector with only 8 bytes regardless of the number of dimensions to start (though certainly a higher number of dimensions will result in higher approximation error). With this lossy representation we can produce a reconstruction of a vector from the 8 byte codes: we simply take each quantization code, look up the corresponding centroid, and concatenate these 8 centroids together to produce a reconstruction. Likewise, we can approximate the distance from the query to an index vector by computing the distance between the query and the reconstruction. We can do this computation quickly for many candidate points by computing the squared difference of each split of the query to all of the centroids for that split. After computing this table, we can compute the squared difference for an index point by looking up the precomputed squared difference for each of the 8 indexes and summing them together to get the total squared difference. This caching trick allows us to quickly rank many candidates without resorting to distance computations in the original vector space.

LOPQ adds one final detail: for each cluster in the multi-index, LOPQ fits a local rotation to the residuals of the points that fall in that cluster. This rotation is simply a PCA that aligns the major directions of variation in the data to the axes followed by a permutation to heuristically balance the variance across the splits of the product quantization. Note that this is the exact preprocessing step that is usually performed at the top-level multi-index. It tends to make the approximate distance computations more accurate by mitigating errors introduced by assuming that each split of the vector in the production quantization is statistically independent from other splits. Additionally, since a rotation is fit for each cluster, they serve to fit the local data distribution better.

Below is a diagram from the LOPQ paper that illustrates the core ideas of LOPQ. K-means (a) is very effective at allocating cluster centroids, illustrated as red points, that target the distribution of the data, but it has other drawbacks at scale as discussed earlier. In the 2d example shown, we can imagine product quantizing the space with 2 splits, each with 1 dimension. Product Quantization (b) clusters each dimension independently and cluster centroids are specified by pairs of cluster indexes, one for each split. This is effectively a grid over the space. Since the splits are treated as if they were statistically independent, we will, unfortunately, get many clusters that are “wasted” by not targeting the data distribution. We can improve on this situation by rotating the data such that the main dimensions of variation are axis-aligned. This version, called Optimized Product Quantization ©, does a better job of making sure each centroid is useful. LOPQ (d) extends this idea by first coarsely clustering the data and then doing a separate instance of OPQ for each cluster, allowing highly targeted centroids while still reaping the benefits of product quantization in terms of scalability.

image

LOPQ is state-of-the-art for quantization methods, and you can find more information about the algorithm, as well as benchmarks, here. Additionally, we provide an open-source implementation in Python and Spark which you can apply to your own datasets. The algorithm produces a set of cluster indexes that can be queried efficiently in an inverted index, as described. We have also explored use cases that use these indexes as a hash for fast deduplication of images and large-scale clustering. These extended use cases are studied here.

Conclusion

We have described our system for large-scale visual similarity search at Flickr. Techniques for producing high-quality vector representations for images with deep learning are constantly improving, enabling new ways to search and explore large multimedia collections. These techniques are being applied in other domains as well to, for example, produce vector representations for text, video, and even molecules. Large-scale approximate nearest neighbor search has importance and potential application in these domains as well as many others. Though these techniques are in their infancy, we hope similarity search provides a useful new way to appreciate the amazing collection of images at Flickr and surface photos of interest that may have previously gone undiscovered. We are excited about the future of this technology at Flickr and beyond.

Acknowledgements

Yannis Kalantidis, Huy Nguyen, Stacey Svetlichnaya, Arel Cordero. Special thanks to the rest of the Computer Vision and Machine Learning team and the Vespa search team who manages Yahoo’s internal search engine.

Researching the Future of Automated Question-Answering

By Dan Pelleg, Yuval Pinter, and David Carmel

There’s no denying it: chat bots are en vogue and it seems like everyone is experimenting with the technology. Facebook introduced bots for their Messenger Platform, Microsoft launched a controversial chatbot on Twitter called Tay, and at Yahoo, we’ve released our own bots on Kik and Messenger. With so much interest, we’re doing our part as research scientists to advance the state-of-the-art in question-answering, which, among other things, will lead to conversational bots appearing more human.

For the second year in a row, the Yahoo Research Text Mining team in Haifa, in collaboration with Emory University and the U.S. National Institute of Standards and Technology, ran a shared challenge for question-answering called LiveQA as part of the annual Text REtrieval Conference (TREC).

During the 24-hour duration of the competition, 14 teams from the USA, China, Germany, Australia, Canada, Israel, and Qatar participated, though preparations for the competition actually started a few months earlier. Each team developed a Web service that, as input, received a free-form question, and then responded with an answer. The answer could not be longer than 1000 characters, and needed to be returned within one minute. The questions used, from Yahoo Answers, were different than the factoid questions used in some previous TREC QA tracks. Yahoo Answers questions are much more diverse and cover multiple question types, such as opinion, advice, and polls. This made the task far more realistic and challenging.

image

During the competition day, the TREC participants received a total of 1,088 questions and sent back a total of over 21,000 answers. They usually had no issue with either the time limit (24 seconds to answer, on average) or the length limit (599 characters per average answer).

The answers were judged manually on a four-point scale (excellent/good/fair/poor). The best system in terms of performance was the “EmoryCrowd” system from Emory. This was a unique system, because it used a combination of computational and human labor. First, it computed candidate answers algorithmically, and then it turned to crowdsourced work to rank them – all in less than 60 seconds. This “cyborg” system provided the highest-quality answers by a noticeable margin, trailed by fully-algorithmic systems from CMU, Emory, and Yahoo Research.

Human intellect was shown to be superior not just within hybrid systems, but also by itself, when we had the judges also evaluate the answers given by users on the original Yahoo Answers site. These answers were significantly better than any automated or hybrid system. Of importance though, the Yahoo Answers users had no time or space limit – if they wanted to, they could work on their answers for up to one week, which would give them an obvious advantage. This is just one reason to keep pushing the envelope on automatic question answering. Computers are faster than people for many tasks, they are available 24/7 (if the power is on), and they scale more readily.

Compared to last year’s challenge, the quality of results (i.e. the answers) improved significantly. However, the quality was still below that of the human responders’. We also introduced a new task for question summarization, and this task is far from being solved. Our plan is to run the LiveQA challenge next year, thus allowing the participants to further improve and extend their systems. We hope that additional teams will join this joint research effort of answering real users’ questions in real-time, with the goal of encouraging progress in the field of Natural Language Processing as it relates to question-answering. Who knows, in the future, maybe our learnings will find their way to a bot near you!