<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><description></description><title>Yahoo Research</title><generator>Tumblr (3.0; @yahooresearch)</generator><link>https://yahooresearch.tumblr.com/</link><item><title>Congratulations 2019 Faculty and Research Engagement Program (FREP) Recipients!</title><description>&lt;p&gt;&lt;a href="https://research.yahoo.com/" target="_blank"&gt;Yahoo Research&lt;/a&gt; is excited to announce the 2019 Faculty and Research Engagement Program (FREP) recipients. This year, we received 100+ proposals from a variety of prestigious institutions around the world. The competition was intense, the review process was difficult, and making the final decisions wasn’t easy. The grants will support professors and students who explore a diverse set of fields, including machine learning, distributed systems, online security, content understanding and recommendation, and images and video understanding.&lt;/p&gt;&lt;p&gt;&lt;b&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;FREP awards grants to faculty members in support of research to enhance people&amp;rsquo;s lives by improving the internet. FREP was founded in 2012 to foster cutting-edge collaborations between scientists in academic settings and those at Yahoo Research. We look forward to the insights, scientific advances, and relationships that will grow from FREP over the coming year and for many years to come!&lt;/p&gt;&lt;p&gt;Congratulations to these very impressive researchers:&lt;br/&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Acceleration for Data Science and Machine Learning&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://www.researchgate.net/profile/Alexander_Gasnikov" target="_blank"&gt;Alexander Gasnikov (co-PI)&lt;/a&gt; &amp;amp; &lt;a href="https://cauribe.mit.edu/" target="_blank"&gt;Cesar Uribe (co-PI)&lt;/a&gt;, Moscow Institute of Physics and Technology (State University) &amp;amp; MIT&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Scalable Online Detection of Complex Patterns in Rapid Event Streams&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://il.linkedin.com/in/assaf-schuster-6abb56149?trk=people-guest_profile-result-card_result-card_full-click" target="_blank"&gt;Assaf Schuster&lt;/a&gt;, Technion&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Optimal-Transport Bayesian Sampling with Applications to Repulsive Attentions in NLP&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://engineering.buffalo.edu/computer-science-engineering/people/faculty-directory/changyou-chen.html" target="_blank"&gt;Changyou Chen&lt;/a&gt;, State University of New York at Buffalo&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Interactive learning from weak annotations&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.cs.columbia.edu/~djhsu/" target="_blank"&gt;Daniel Hsu&lt;/a&gt;, Columbia University&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Deep Learning for Analyzing Ultrasound Movie Images&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://profiles.stanford.edu/daniel-rubin" target="_blank"&gt;Daniel Rubin&lt;/a&gt;, Stanford University School of Medicine&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Detecting Intrinsic Visual Privacy Threats&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://www3.cs.stonybrook.edu/~hling/" target="_blank"&gt;Haibin Ling&lt;/a&gt;, Stony Brook University (SUNY)&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;PASTE: PArallel Synthesis, Training and Enhancement via Distributionally Robust Optimization and Optimal Transport&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://web.stanford.edu/~jblanche/" target="_blank"&gt;Jose Blanchet&lt;/a&gt;, Stanford University&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Representation Learning for Product Graphs&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://usc-isi-i2.github.io/kejriwal/" target="_blank"&gt;Mayank Kejriwal&lt;/a&gt;, University of Southern California&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Large-scale multi-objective sequential decision making&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://weng.fr/" target="_blank"&gt;Paul Weng&lt;/a&gt; (co-PI) &amp;amp; &lt;a href="http://www.cs.put.poznan.pl/wkotlowski/" target="_blank"&gt;Wojciech Kotlowski&lt;/a&gt; (co-PI), Shanghai Jiao Tong University &amp;amp; Poznam University&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Large-Scale Graph Embeddings&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://www3.cs.stonybrook.edu/~skiena/" target="_blank"&gt;Steven Skiena&lt;/a&gt;, Stony Brook University&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Communication-Efficient Federated Learning&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="https://web.stanford.edu/~tsachy/" target="_blank"&gt;Tsachy Weissman&lt;/a&gt;, Stanford&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Adversarial Reformulation-Aware Query Suggestion with Graph Convolutional Networks&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://web.cs.ucla.edu/~weiwang/" target="_blank"&gt;Wei Wang&lt;/a&gt;, UCLA&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;&lt;b&gt;Modeling Temporal Dynamics of User Behavior for Improved Advertising&lt;/b&gt;&lt;br/&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.dabi.temple.edu/~zoran/" target="_blank"&gt;Zoran Obradovic&lt;/a&gt;, Temple University&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;</description><link>https://yahooresearch.tumblr.com/post/187608558221</link><guid>https://yahooresearch.tumblr.com/post/187608558221</guid><pubDate>Mon, 09 Sep 2019 15:53:41 -0700</pubDate><category>machine learning</category><category>data science</category><category>big data</category><category>deep learning</category></item><item><title>Yahoo Research Wins Runner-Up Best Paper Award for “Time-Aware Prospective Modeling of Users for Online Display Advertising” at AdKDD</title><description>&lt;p&gt;By &lt;a href="https://www.linkedin.com/in/kim-capps-tanaka-6b7a39/" target="_blank"&gt;Kim Capps-Tanaka&lt;/a&gt;, Chief of Staff, Yahoo Research&lt;/p&gt;&lt;p&gt;&lt;a href="https://www.kdd.org/kdd2019/" target="_blank"&gt;KDD 2019&lt;/a&gt; in Anchorage, Alaska, has been fantastic so far and yesterday was especially exciting as we won AdKDD’s Runner-Up Best Paper Award for “Time-Aware Prospective Modeling of Users for Online Display Advertising”.&lt;/p&gt;&lt;p&gt;Congratulations to &lt;a href="https://www.linkedin.com/in/gligorijevic" target="_blank"&gt;Djordje Gligorijevic&lt;/a&gt; (Research Scientist), &lt;a href="https://www.linkedin.com/in/jelenastojanovic" target="_blank"&gt;Jelena Gligorijevic&lt;/a&gt; (Research Scientist) and &lt;a href="https://www.linkedin.com/in/aaron-flores-12b405/" target="_blank"&gt;Aaron Flores&lt;/a&gt; (Sr. Director)! 
&lt;br/&gt;&lt;/p&gt;&lt;blockquote class="twitter-tweet" data-lang="en"&gt;&lt;p lang="en" dir="ltr"&gt;Congrats Djordje Gligorijevic, Jelena Gligorijevic, and Aaron Flores for receiving the &lt;a href="https://twitter.com/hashtag/AdKDD?src=hash&amp;amp;ref_src=twsrc%5Etfw" target="_blank"&gt;#AdKDD&lt;/a&gt; Runner-Up Best Paper Award for “Time-Aware Prospective Modeling of Users for Online Display Advertising” at &lt;a href="https://twitter.com/hashtag/KDD19?src=hash&amp;amp;ref_src=twsrc%5Etfw" target="_blank"&gt;#KDD19&lt;/a&gt;! &lt;a href="https://twitter.com/hashtag/datascience?src=hash&amp;amp;ref_src=twsrc%5Etfw" target="_blank"&gt;#datascience&lt;/a&gt; &lt;a href="https://twitter.com/hashtag/machinelearning?src=hash&amp;amp;ref_src=twsrc%5Etfw" target="_blank"&gt;#machinelearning&lt;/a&gt; &lt;a href="https://twitter.com/hashtag/KDD2019?src=hash&amp;amp;ref_src=twsrc%5Etfw" target="_blank"&gt;#KDD2019&lt;/a&gt; &lt;a href="https://twitter.com/kdd_news?ref_src=twsrc%5Etfw" target="_blank"&gt;@kdd_news&lt;/a&gt; &lt;a href="https://t.co/3cTR2svB6H" target="_blank"&gt;pic.twitter.com/3cTR2svB6H&lt;/a&gt;&lt;/p&gt;&lt;div&gt;— Yahoo Research (@YahooResearch) &lt;/div&gt;&lt;a href="https://twitter.com/YahooResearch/status/1158501210350948352?ref_src=twsrc%5Etfw" target="_blank"&gt;August 5, 2019&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"&gt;&lt;/script&gt;&lt;p&gt;If you’re at KDD, we’d love to chat with you! Stop by booth #39 or any of the poster sessions below:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;“Predicting Different Type of Conversions using Multi-Task Learning”, &lt;b&gt;Junwei Pan&lt;/b&gt;, &lt;b&gt;Yizhi Mao&lt;/b&gt;, Alfonso Ruiz, Yu Sun, &lt;b&gt;Aaron Flores&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;“Carousel Ads Optimization in Yahoo Gemini Native”, &lt;b&gt;Oren Somekh&lt;/b&gt;, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;“Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, &lt;b&gt;Shaunak Mishra&lt;/b&gt;, &lt;b&gt;Jelena Gligorijevic&lt;/b&gt;, &lt;b&gt;Tarun Bhatia&lt;/b&gt;, &lt;b&gt;Narayan Bhamidipati&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;“Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, &lt;b&gt;Nicolas Grislain&lt;/b&gt;, &lt;b&gt;Nicolas Perrin&lt;/b&gt;, &lt;b&gt;Antoine Thabault&lt;/b&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Tues, 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;p&gt;* Bold authors denotes Yahoo Researchers&lt;/p&gt;&lt;p&gt;Thanks!&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/186824574596</link><guid>https://yahooresearch.tumblr.com/post/186824574596</guid><pubDate>Tue, 06 Aug 2019 15:09:12 -0700</pubDate><category>kdd</category><category>kdd2019</category><category>data science</category><category>machine learning</category></item><item><title>Meet Yahoo Research at KDD 2019</title><description>&lt;p&gt;By &lt;a href="https://www.linkedin.com/in/kim-capps-tanaka-6b7a39/" target="_blank"&gt;Kim Capps-Tanaka&lt;/a&gt;, Chief of Staff, Yahoo Research&lt;/p&gt;&lt;p&gt;If you’re attending &lt;a href="https://www.kdd.org/kdd2019/" target="_blank"&gt;KDD&lt;/a&gt; in Anchorage, Alaska, the Yahoo Research team would love to meet you! Send us an &lt;a href="mailto:yahookdd@verizonmedia.com" target="_blank"&gt;email&lt;/a&gt; or &lt;a href="https://twitter.com/YahooResearch" target="_blank"&gt;tweet&lt;/a&gt; to discuss research or job opportunities on the team.&lt;/p&gt;&lt;p&gt;In addition to hosting a booth, we’re excited to present papers, posters, and talks. &lt;/p&gt;&lt;p&gt;&lt;b&gt;Sunday, August 4th&lt;/b&gt;&lt;/p&gt;&lt;p&gt;“Modeling and Applications for Temporal Point Processes”, Junchi Yan, Hongteng Xu, &lt;b&gt;Liangda Li&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt; 8am - 12pm, Summit 8-Ground Level, Egan&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;b&gt;Monday, August 5th&lt;/b&gt;&lt;/p&gt;&lt;p&gt;“Time-Aware Prospective Modeling of Users for Online Display Advertising”, &lt;b&gt;Djordje Gligorijevic&lt;/b&gt;, &lt;b&gt;Jelena Gligorijevic&lt;/b&gt;, &lt;b&gt;Aaron Flores&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;8:40am - 9am, Kahtnu 2 - Level 2, Dena’ina&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“The Future of Ads”, &lt;b&gt;Brendan Kitts&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;3pm-3:30pm, Kahtnu 2 - Level 2, Dena’ina&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“Learning from Multi-User Activity Trails for B2B Ad Targeting”, &lt;b&gt;Shaunak Mishra&lt;/b&gt;, &lt;b&gt;Jelena Gligorijevic&lt;/b&gt;, &lt;b&gt;Narayan Bhamidipati&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;4:35pm-4:55pm, Kahtnu 2- Level 2, Dena’ina&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“Automatic Feature Engineering From Very High Dimensional Event Logs Using Deep Neural Networks”, &lt;b&gt;Kai Hu&lt;/b&gt;, &lt;b&gt;Joey Wang&lt;/b&gt;, &lt;b&gt;Yong Liu&lt;/b&gt;, &lt;b&gt;Datong Chen&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;b&gt;Tuesday, August 6th&lt;/b&gt;&lt;/p&gt;&lt;p&gt;“Predicting Different Type of Conversions using Multi-Task Learning”, &lt;b&gt;Junwei Pan&lt;/b&gt;, &lt;b&gt;Yizhi Mao&lt;/b&gt;, Alfonso Ruiz, Yu Sun, &lt;b&gt;Aaron Flores&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“Carousel Ads Optimization in Yahoo Gemini Native”, &lt;b&gt;Oren Somekh&lt;/b&gt;, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev&lt;/p&gt;&lt;ul&gt;&lt;li&gt;7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, &lt;b&gt;Shaunak Mishra&lt;/b&gt;, &lt;b&gt;Jelena Gligorijevic&lt;/b&gt;, &lt;b&gt;Tarun Bhatia&lt;/b&gt;, &lt;b&gt;Narayan Bhamidipati&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;“Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, &lt;b&gt;Nicolas Grislain&lt;/b&gt;, &lt;b&gt;Nicolas Perrin&lt;/b&gt;, &lt;b&gt;Antoine Thabault&lt;/b&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;* Bold authors denotes Yahoo Researchers&lt;/p&gt;&lt;p&gt;Hope to see you at KDD!&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/186797645996</link><guid>https://yahooresearch.tumblr.com/post/186797645996</guid><pubDate>Mon, 05 Aug 2019 12:26:07 -0700</pubDate><category>machine learning</category><category>data science</category><category>deep learning</category></item><item><title>Introducing the Yahoo News Ranked Multi-label Corpus, a Novel Dataset to Improve Multilabel Learning</title><description>&lt;p&gt;&lt;br/&gt;By Akshay Soni, Aasish Pappu&lt;/p&gt;&lt;p&gt;Most content-based websites, like Yahoo News, HuffPost, or any given news site, organize their stories according to subject matter or in some similar way. You can imagine that websites with a huge amount of stories must need an automated method to filter or categorize them as the content is ingested into their systems. For example, algorithms that power Yahoo News label news articles with tags (e.g., &lt;i&gt;Military conflict, Nuclear policy, Refugees&lt;/i&gt;) as they are ingested, and then display the content by subject matter and/or on a personalized feed. This well-known process of labeling content with all its relevant tags is known as Multilabel Learning (MLL).&lt;br/&gt;&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="164" data-orig-width="622"&gt;&lt;img src="https://66.media.tumblr.com/faf849077e6724a6fcd5c6cf470c4fdf/tumblr_inline_ovd4x8Pgmq1rgj0aw_540.png" data-orig-height="164" data-orig-width="622" alt="image"/&gt;&lt;/figure&gt;&lt;center&gt;&lt;address dir="ltr"&gt;An overview of a MLL system in action: as the news articles are ingested, MLL tags them with all the relevant labels.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;Up to now, whenever scientists and engineers use MLL to create their own specific models to label content however they like, they have used datasets that have pre-computed features like bag-of-words, or dense representations like doc2vec. In the process of writing our recent &lt;a href="http://acl2017.org/" target="_blank"&gt;ACL 2017&lt;/a&gt; publication &amp;ldquo;&lt;a href="https://research.yahoo.com/publications/8893/doctag2vec-embedding-based-multi-label-learning-approach-document-tagging" target="_blank"&gt;DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging&lt;/a&gt;,&amp;rdquo; presented in the &lt;a href="https://sites.google.com/site/repl4nlp2017/home" target="_blank"&gt;Rep4NLP&lt;/a&gt; workshop, we compiled a novel dataset with &lt;i&gt;raw news stories&lt;/i&gt; that allows for researchers to be able to construct &lt;i&gt;their own&lt;/i&gt; features that are best-suited for their MLL algorithms. As part of our &lt;a href="https://webscope.sandbox.yahoo.com/" target="_blank"&gt;Webscope data-sharing program&lt;/a&gt;, we are making our dataset, called &lt;a href="https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&amp;amp;did=84" target="_blank"&gt;Yahoo News Ranked Multi-label Corpus&lt;/a&gt; (YNMLC), available to academics to further advance MLL research.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;While traditional MLL approaches rely on given features, DocTag2Vec operates on raw text and automatically learns the best features of that text by embedding both documents and the tags in the same vector space. Inference is then done via a simple nearest-neighbor based approach. DocTag2Vec relies on training data that is composed of the raw text of every document and the labels associated with them.&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="124" data-orig-width="369"&gt;&lt;img src="https://66.media.tumblr.com/dd216c0f9ddd2ac99afd8279a2df7825/tumblr_inline_ovd4xpbZ6O1rgj0aw_540.png" data-orig-height="124" data-orig-width="369" alt="image"/&gt;&lt;/figure&gt;&lt;center&gt;&lt;address dir="ltr"&gt;DocTag2Vec embeds documents and the labels associated with them in a common vector-space. This allows inference by a nearest-neighbor approach.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;There are many standard datasets available for MLL, but all of them directly provide features and not the actual text of the documents. This allows researchers to work on new algorithms that directly use the provided features but without improving the features themselves. &lt;b&gt;Our YNMLC corpus provides raw text so that researchers can extract their own features that are best for their algorithms&lt;/b&gt;. Apart from that, to the best of our knowledge, &lt;b&gt;our corpus is the only one that provides a ranking of the labels for each document in terms of its importance&lt;/b&gt;. YNMLC is one of the few large-scale, expertly manually-labeled (by Yahoo News editors) datasets addressing the task of MLL. The corpus contains 48,968 articles that are tagged by any subset of 413 labels. These tags correspond to &lt;a href="https://techcrunch.com/2016/10/04/yahoo-rebrands-its-main-app-as-yahoo-newsroom-lets-you-post-your-own-news-links/" target="_blank"&gt;Vibes&lt;/a&gt; (akin to topics) in the Yahoo Newsroom app. &lt;/p&gt;&lt;p&gt;MLL is an area of research that we have applied to labeling news stories. MLL can also be used to label music, videos, blog posts, and virtually any other type of online content. We look forward to seeing the innovative ways in which YNMLC will be used to develop new approaches.&lt;/p&gt;&lt;p&gt;&lt;i&gt;Please cite the following paper if you are using this dataset for academic purposes:&lt;/i&gt;&lt;/p&gt;&lt;p&gt;&amp;ldquo;DocTag2Vec: An embedding based Multilabel Learning approach for Document Tagging&amp;rdquo;, The 2nd Workshop on Representation Learning for NLP, 2017. Sheng Chen, Akshay Soni, Aasish Pappu, and Yashar Mehdad.&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/164789421126</link><guid>https://yahooresearch.tumblr.com/post/164789421126</guid><pubDate>Wed, 30 Aug 2017 08:03:24 -0700</pubDate><category>Oath</category><category>Yahoo</category><category>Yahoo News</category><category>Yahoo Research</category><category>Research</category><category>Machine Learning</category><category>Datasets</category><category>Data</category></item><item><title>HBase Goes Fast and Lean with the Accordion Algorithm</title><description>&lt;p&gt;By Edward Bortnikov, Anastasia Braginsky, and Eshcar Hillel&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="500" data-orig-width="500"&gt;&lt;img src="https://66.media.tumblr.com/08dfd7c77412ad46d26a56346b08fb77/tumblr_inline_orfajleJh01rgj0aw_540.gif" data-orig-height="500" data-orig-width="500"/&gt;&lt;/figure&gt;&lt;p&gt;Modern products powered by NoSQL key-value (KV-)storage technologies exhibit ever-increasing performance expectations. Ideally, NoSQL applications would like to enjoy the speed of in-memory databases without giving up on reliable persistent storage guarantees. Our Scalable Systems research team has implemented a new algorithm named Accordion, that takes a significant step toward this goal, into the forthcoming release of Apache HBase 2.0.&lt;/p&gt;&lt;p&gt;&lt;a href="https://hbase.apache.org/" target="_blank"&gt;HBase&lt;/a&gt;, a distributed KV-store for Hadoop, is used by many companies every day to scale products seamlessly with huge volumes of data and deliver real-time performance. At Yahoo, HBase powers a variety of products, including Yahoo Mail, Yahoo Search, Flurry Analytics, and more. Accordion is a complete re-write of core parts of the HBase server technology, named RegionServer. It improves the server scalability via a better use of RAM. Namely, it accommodates more data in memory and writes to disk less frequently. This manifests in a number of desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously.&lt;/p&gt;&lt;p&gt;We stress-tested Accordion-enabled HBase under a variety of workloads. Our experiments exercised different blends of reads and writes, as well as different key distributions (heavy-tailed versus uniform). We witnessed performance improvements across the board. Namely, we saw write throughput increases of 20% to 40% (depending on the workload), tail read latency reductions of up to 10%, disk write reductions of up to 30%, and also some modest Java garbage collection overhead reduction. The figures below further zoom into Accordion’s performance gains, compared to the legacy algorithm.&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion1.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion1.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="276" data-orig-width="528" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion1.png"&gt;&lt;img src="https://66.media.tumblr.com/e965d97feebea0c7f99e3e049e083717/tumblr_inline_p7k6h4KzPn1rgj0aw_540.png" alt="image" data-orig-height="276" data-orig-width="528" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion1.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;center&gt;&lt;address dir="ltr"&gt;Figure 1. Accordion’s write throughput compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf (heavy-tailed) and Uniform primary key distributions.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion2.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion2.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="238" data-orig-width="388" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion2.png"&gt;&lt;img src="https://66.media.tumblr.com/334e9016e944e7a1b1e5c7722dc042e1/tumblr_inline_p7k6h43tqh1rgj0aw_540.png" alt="image" data-orig-height="238" data-orig-width="388" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion2.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;center&gt;&lt;address dir="ltr"&gt;Figure 2. Accordion’s read latency quantiles compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion3.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/Accordion3.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="217" data-orig-width="447" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion3.png"&gt;&lt;img src="https://66.media.tumblr.com/79517f4b4f6d30493366569b87144272/tumblr_inline_p7k6h4RPpv1rgj0aw_540.png" alt="image" data-orig-height="217" data-orig-width="447" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion3.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;center&gt;&lt;address dir="ltr"&gt;Figure 3. Accordion’s disk I/O compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;Accordion is inspired by the&lt;a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" target="_blank"&gt; Log-Structured-Merge (LSM) tree&lt;/a&gt; design pattern that governs HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts&lt;a href="https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and" target="_blank"&gt; multi-versioned concurrency control&lt;/a&gt; – that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called &lt;i&gt;compaction&lt;/i&gt;, removes the redundant cells and creates larger files.&lt;/p&gt;&lt;p&gt;LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees were designed in the age when RAM was in very short supply, and therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore size managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization. &lt;/p&gt;&lt;p&gt;Accordion reapplies the LSM principle to MemStore in order to eliminate redundancies and other overhead while the data is still in RAM. The MemStore memory image is therefore “breathing” (periodically expanding and contracting), similarly to how an accordion bellows. This work pattern decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint. &lt;/p&gt;&lt;p&gt;With fewer flushes, the write operations are stalled less frequently as the MemStore overflows, and as a result, the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having fewer disk writes also means having less compaction happening in the background, i.e., fewer cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be thought of as a catalyst that enables the system to move faster as a whole. &lt;/p&gt;&lt;p&gt;Accordion currently provides two levels of in-memory compaction: &lt;b&gt;basic&lt;/b&gt; and &lt;b&gt;eager&lt;/b&gt;. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. Future implementations may tune the optimal compaction policy automatically. &lt;/p&gt;&lt;p&gt;Accordion replaces the default MemStore implementation in the production HBase code. Contributing its code to production HBase could not have happened without intensive work with the open source Hadoop community, with contributors stretched across companies, countries, and continents. The project took almost two years to complete, from inception to delivery. &lt;/p&gt;&lt;p&gt;Accordion will become generally available in the upcoming HBase 2.0 release. We can’t wait to see it power existing and future products at Yahoo and elsewhere.&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/161742308886</link><guid>https://yahooresearch.tumblr.com/post/161742308886</guid><pubDate>Mon, 12 Jun 2017 10:58:33 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo Hadoop</category><category>HBase</category><category>Accordion</category><category>nosql</category></item><item><title>Researching the Definition of Good Online Conversations and How They Should Rank with the Yahoo News Annotated Comments Corpus</title><description>&lt;p&gt;By Courtney Napoles, &lt;a href="https://research.yahoo.com/researchers/aasishkp" target="_blank"&gt;Aasish Pappu&lt;/a&gt;, Joel Tetreault&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Comment threads following online news articles often range from vacuous to hateful. That said, good conversations &lt;i&gt;do&lt;/i&gt; occur online, with people expressing different viewpoints and attempting to inform, convince, or better understand the other side, even if they can get lost among the multitude of unconstructive comments. At Yahoo Research, we show in recent statistical experiments that automatically identifying and &lt;i&gt;ranking good&lt;/i&gt; conversations on top will cultivate a more civil and constructive atmosphere in online communities and potentially encourage participation from more users [1]. &lt;/p&gt;&lt;p&gt;In an effort to foster more respectful online discussions and encourage more research among academics surrounding comments, we present the &lt;a href="http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&amp;amp;did=83" target="_blank"&gt;Yahoo News Annotated Comments Corpus&lt;/a&gt; (YNACC) via our data sharing program, &lt;a href="https://webscope.sandbox.yahoo.com/" target="_blank"&gt;Webscope&lt;/a&gt;. The corpus contains 522K comments from 140K comment threads posted in response to online news articles, and contains manual annotations for a subset of 2.4K comment threads and 9.2K comments. The annotations include 6 attributes of individual comments: sentiment, tone, agreement with other commenters, topic of the comment, intended audience, and persuasiveness. The annotations also include 3 attributes of threads: constructiveness, agreeability within the conversation, and type of conversation, i.e., &lt;a href="https://en.wikipedia.org/wiki/Flaming_(Internet)" target="_blank"&gt;flamewars&lt;/a&gt; vs positive/respectful [2].&lt;/p&gt;&lt;p&gt;Annotated conversations in the YNACC corpus were used to create a predictive algorithm and train statistical models to automatically detect “good” conversations. We call these good conversations ERICs: Engaging, Respectful, and/or Informative Conversations, and they are characterized by:&lt;br/&gt;&lt;/p&gt;&lt;ul&gt;&lt;li&gt;A respectful exchange of ideas, opinions, and/or information in response to a given topic or topics.&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Opinions expressed as an attempt to elicit a dialogue or persuade.&lt;br/&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Comments that seek to contribute some new information or perspective on the relevant topic.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/ynacc/YNACC01.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/ynacc/YNACC01.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="360" data-orig-width="670" data-orig-src="https://s.yimg.com/ge/ynacc/YNACC01.png"&gt;&lt;img src="https://66.media.tumblr.com/f27aae404a2edd374a211027a455cb30/tumblr_inline_p7k6h4M2dM1rgj0aw_540.png" alt="image" data-orig-height="360" data-orig-width="670" data-orig-src="https://s.yimg.com/ge/ynacc/YNACC01.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/ynacc/YNACC02.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/ynacc/YNACC02.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="360" data-orig-width="670" data-orig-src="https://s.yimg.com/ge/ynacc/YNACC02.png"&gt;&lt;img src="https://66.media.tumblr.com/20de8547adf59086cd08196ce92fea2f/tumblr_inline_p7k6h45p1N1rgj0aw_540.png" alt="image" data-orig-height="360" data-orig-width="670" data-orig-src="https://s.yimg.com/ge/ynacc/YNACC02.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;Example of an ERIC conversation (top) and a non-ERIC conversation (bottom)&lt;/center&gt;&lt;br/&gt;&lt;p&gt;ERICs have no single identifying attribute. A good conversation is determined by how many respectful, engaging, and persuading comments are present. For instance, an exchange where communicants are in total agreement throughout can be an ERIC, as can an exchange with a heated disagreement. Our algorithm ranks either of these types of exchanges higher than those that lack ERICs. Many of the labels for the ERICs in our dataset are the result of a new coding scheme (annotation taxonomy) we developed and are for characteristics of online conversations not captured by traditional argumentation or dialogue features. Some of the labels we collected have been annotated in previous work [3,4], and this is the first time they are aggregated in a single corpus at the dialogue level.&lt;/p&gt;&lt;p&gt;Additionally, we collected annotations on 1K threads from the &lt;a href="http://www.lrec-conf.org/proceedings/lrec2016/pdf/1126_Paper.pdf" target="_blank"&gt;Internet Argument Corpus&lt;/a&gt;, representing another domain of online debates. Our corpus and annotation scheme is the first exploration of how characteristics of individual comments contribute to the dialogue-level classification of an exchange. We hope YNACC will facilitate research to understand ERICs and other aspects of dialogue in general. &lt;/p&gt;&lt;p&gt;&lt;b&gt;The technical contributions of this dataset are described in two scientific papers:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;[1] Courtney Napoles, Aasish Pappu, and Joel Tetreault. &lt;a href="https://research.yahoo.com/publications/8882/automatically-identifying-good-conversations-online-yes-they-do-exist" target="_blank"&gt;&amp;ldquo;Automatically Identifying Good Conversations Online (Yes, they do exist!)&amp;rdquo;&lt;/a&gt;. In Proceedings of ICWSM'17.&lt;/p&gt;&lt;p&gt;[2] Courtney Napoles, Joel Tetreault, Aasish Pappu, Enrica Rosato and Brian Provenzale. 2017. &lt;a href="https://research.yahoo.com/publications/8881/finding-good-conversations-online-yahoo-news-annotated-comments-corpus" target="_blank"&gt;“Finding Good Conversations Online: The Yahoo News Annotated Comments Corpus”&lt;/a&gt;. In Proceedings of The 11th Linguistic Annotation Workshop (LAW-XI).&lt;/p&gt;&lt;p&gt;&lt;b&gt;References:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;[3] Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. Internet Argument Corpus 2.0: An SQL schema for dialogic social media and the corpora to go with it. LREC 2016.&lt;/p&gt;&lt;p&gt;[4] Marilyn Walker, Jean Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. A corpus for research on deliberation and debate. LREC 2012.&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/159942008511</link><guid>https://yahooresearch.tumblr.com/post/159942008511</guid><pubDate>Mon, 24 Apr 2017 09:01:32 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo News</category><category>Comments</category><category>data</category><category>dataset</category></item><item><title>Reinventing Mail Search and Solving the Catch 22 Dilemma Between Precision and Recall</title><description>&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/liane" target="_blank"&gt;Liane Lewin-Eytan&lt;/a&gt; and &lt;a href="https://research.yahoo.com/researchers/yoelle" target="_blank"&gt;Yoelle Maarek&lt;/a&gt;&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;i&gt;The Yahoo Mail Mining team in Haifa is publishing two research papers in the proceedings of the &lt;a href="http://www.www2017.com.au/" target="_blank"&gt;26th International World Wide Web Conference&lt;/a&gt; in Perth, Australia this week. These publications, discussed in this blog post, will be presented in a track dedicated to the growing field of email and personal search. We would like to encourage the growth and adoption of research in this field by sharing some of our insights and spurring new ideas. &lt;/i&gt;&lt;/p&gt;&lt;p&gt;How many times have you tried to search for an email and not been able to find it? You might think that getting the right email search result is as easy as getting a good search result on the Web, where you often find what you’re looking for on the first try. Unfortunately, experience tells us otherwise. The reason? Mail search is, perhaps surprisingly, an entirely different animal. At Yahoo Research, we’ve been working hard at changing this situation. &lt;/p&gt;&lt;p&gt;When searching your mailbox, you are typically looking for a message that you have received and most probably read. You try to remember the name of the person who sent it to you or some distinctive words in the message. In other words, you try to &lt;i&gt;re-find&lt;/i&gt; a given message, while in Web search, by contrast, you try to &lt;i&gt;discover&lt;/i&gt; new information. In information retrieval, the computer science discipline behind search, this difference is reflected in two measures: &lt;b&gt;recall&lt;/b&gt; (optimally “all the truth” – or in practice, the fraction of true instances that are retrieved), and &lt;b&gt;precision&lt;/b&gt; (optimally “nothing but the truth” – or in practice, the fraction of retrieved instances that are true). One known mathematical evidence of these metrics is that increasing one automatically decreases the other. Web search targets precision as it draws from a large pool of potentially relevant results, and the searchers do not know and do not care if some relevant results are omitted as long as they get a sufficient answer. By contrast, mail search targets recall, as users know with certainty when the search results miss the messages they want. &lt;/p&gt;&lt;p&gt;Since users want to make sure they do not miss anything when performing a mail search, they expect their results to be sorted by time so as to scan all results in a systematic manner and maintain an illusion of perfect recall. Unfortunately, by doing so, they impose hard challenges on the search mechanism, which is forced to impose strict relevance constraints on the results returned to the user. This is necessary because otherwise, a remotely relevant, yet very recent message could be pushed to the top of the list. In other words, the time-sort view of results users expect imposes high precision constraints. This then negatively impacts recall, which is what users really care about. Catch 22! &lt;/p&gt;&lt;p&gt;By analyzing search behavior, we have discovered that mail users don’t use search as precisely as intended. We have seen that in 40% of the cases where a user performs a search, they don’t really &lt;i&gt;search&lt;/i&gt;. Instead, they actually &lt;i&gt;browse&lt;/i&gt; by issuing a “contact query” (i.e. a query that simply consists of a contact name) and then scanning the results. Even non-contact queries remain very vague, with an average length of 1.5 search terms. This again highlights the filtering/browsing behavior of Mail users. The two charts below represent search usage stats based on Yahoo Web Mail traffic.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/Mail_Search_Stats_-_1.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/Mail_Search_Stats_-_1.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="434" data-orig-width="1086" data-orig-src="https://s.yimg.com/ge/research/Mail_Search_Stats_-_1.png"&gt;&lt;img src="https://66.media.tumblr.com/14668804dd087f8373eb309c97d95de5/tumblr_inline_p7k6h4QKYn1rgj0aw_540.png" alt="image" data-orig-height="434" data-orig-width="1086" data-orig-src="https://s.yimg.com/ge/research/Mail_Search_Stats_-_1.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;Figure 1: &lt;a href="https://research.yahoo.com/publications/8655/rank-time-or-relevance-revisiting-email-search" target="_blank"&gt;Mail search stats&lt;/a&gt;&lt;/center&gt;&lt;p&gt;&lt;br/&gt;Given that email is critical to so many people, we feel it is important to make sure all of our 225M monthly active Yahoo Mail users are getting the best experience possible. With our users in mind, the Yahoo Mail Mining Research team has adopted two different approaches to achieve improved precision and recall, making email search more effective. The first focuses on search &lt;b&gt;results&lt;/b&gt; and second on search &lt;b&gt;queries&lt;/b&gt;.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Focusing on the results&lt;/b&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://research.yahoo.com/publications/8655/rank-time-or-relevance-revisiting-email-search" target="_blank"&gt;We have developed a first of its kind ranking algorithm&lt;/a&gt; [1] that ranks results by &lt;b&gt;relevance&lt;/b&gt; rather than by time sent or received. That way, users are able to efficiently find the messages they are searching for, even if they are not very recent. This relevance ranking algorithm is based on a varied set of features, taking into account every signal that could imply the relevance of a message to a query. Included in the features are those based on the query-message similarity, the message recency, and the message itself (its links, attachments, etc.). Additionally, our algorithm takes into account actions performed on a message. For example, if you “star” a message, it will increase its relevance. However, if you haven’t opened it at all, its relevance will be lessened.&lt;/p&gt;&lt;p&gt;The table below represents the increase obtained in &lt;a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank" target="_blank"&gt;mean reciprocal rank&lt;/a&gt; (a.k.a., MRR, one of the most popular metrics in search, based on the average rank of the clicked message), comparing relevance ranking to time ranking, as well as the influence of the different set of features.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1036" data-orig-height="546" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/9bb339892892a2162271b518ad538c3d/tumblr_inline_ondhuofaXw1rgj0aw_540.png" alt="image" data-orig-width="1036" data-orig-height="546"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 2: Performance of relevance ranking vs. time ranking (lift %)&lt;/center&gt;&lt;p&gt;&lt;br/&gt;Our ranking algorithm can display search results in order of relevance, as opposed to reverse chronological order. Consequently, it greatly relaxes strict match constraints, thereby increasing recall by having the most-relevant results at the top and less relevant results – typically missed altogether in the sort-by-time paradigm – still included down the list.&lt;/p&gt;&lt;p&gt;However, we understand that most users may not immediately be open to this revolutionary shift, and will still expect results ranked by time to reinforce their perception of perfect recall. With this in mind, we &lt;a href="https://yahoomail.tumblr.com/post/158969414236/updated-yahoo-mail-app-brings-you-top-search" target="_blank"&gt;recently released&lt;/a&gt; an intermediate step for presenting search results in which we promote a small number of the most-relevant results on top of the standard time-sorted results. We refer to the top results as &lt;b&gt;heroes&lt;/b&gt;, and the research and algorithm supporting our method are detailed in &lt;a href="https://research.yahoo.com/publications/8870/promoting-relevant-results-time-ranked-mail-search" target="_blank"&gt;our paper&lt;/a&gt; [2] at the &lt;a href="http://www.www2017.com.au/" target="_blank"&gt;26th International World Wide Web Conference&lt;/a&gt; (WWW ‘17) in Perth, Australia. Heroes target precision while the traditional ranked-by-time results target recall; having these two types of results coexist allows us to solve our catch 22 dilemma.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="762" data-orig-height="543" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/b807807ce323326a98a123317f37fca4/tumblr_inline_ondhxgkn2G1rgj0aw_540.png" alt="image" data-orig-width="762" data-orig-height="543"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 3: Heroes (appearing as ‘Top results’ in Yahoo desktop mail)&lt;/center&gt;&lt;p&gt;&lt;br/&gt;The chart below is based on an online evaluation of the implementation of the heroes feature. Our evaluation demonstrates a lift of 12% in MRR in Yahoo Web mail.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="693" data-orig-height="423" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/d5290e4ab841ecb46bec5388b983e817/tumblr_inline_ondhyomE9O1rgj0aw_540.png" alt="image" data-orig-width="693" data-orig-height="423"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 4: Performance of heroes model vs. traditional time ranking&lt;/center&gt;&lt;p&gt;&lt;br/&gt;&lt;b&gt;Focusing on the queries&lt;/b&gt;&lt;/p&gt;&lt;p&gt;If we want the quality of search results to improve, we first need users to issue more specific queries and stop entering one-word queries followed by browsing. With such underspecified queries, even the best relevance algorithm we can invent will bring only limited benefits. So we need to assist users in formulating queries and bring them the results they seek faster. The existing query-assistance mechanism in Yahoo Mail today does a beautiful job with contact suggestions (through our Xobni technology) and also leverages past queries from a user’s search history. However, this technology is not sufficient to increase the length of queries and make them more precise. &lt;/p&gt;&lt;p&gt;To improve this situation, we have developed a novel query auto-completion module that takes as input a few characters or words entered by users (a prefix) and offers a list of suggested queries that are generated from three sources: their query log (search history), the content of their mailbox, and the general query log of all users. Moreover, when generating queries based on the general query log, we take into account various demographic attributes of the user such as gender, age, and location, based on the intuition that queries from “users like me” have more of a chance to be relevant to my personal mailbox. For example, the query prefix “sch” typed by a 30-year-old professional from New York might refer to “schedule,” while the same prefix typed by a student from San Francisco might refer to “scholarship.” Or, when typing “Be,” the professional intended “Best Buy,” while the student had the intention of typing “Berkeley.” The algorithm we designed to surface suggestions allows a user to search in a perhaps more precise of a manner than they would have otherwise, thereby helping to exploit the full capacities of the mail search system. &lt;a href="https://research.yahoo.com/publications/8869/demographics-mail-search-and-their-application-query-suggestion" target="_blank"&gt;Our research paper&lt;/a&gt; on query auto-completion [3], including a thorough analysis of the demographics of mail search, will also be published at WWW ‘17.&lt;/p&gt;&lt;p&gt;The combination of all these signals improves the quality of the suggestions by up to 150% when considering the average rank of the clicked suggestion. When splitting the queries according to types (personal contacts, names of companies, general content), the contribution of the general query log is the highest for queries relating to companies/organizations. Examples of this include &lt;i&gt;Amazon&lt;/i&gt; or &lt;i&gt;United&lt;/i&gt;, where a user typically searches for their last Amazon purchase or upcoming United Airlines flight. The mail traffic originating from a company is usually formatted very similarly for all users; my Amazon purchase notification will look very similar to yours, and the same goes for our flight itineraries if we both booked with United. Thus, it follows that we probably also search for those items in a similar manner.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="884" data-orig-height="816" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/d868abc113d1a63313f3282352777631/tumblr_inline_ondi22TM1b1rgj0aw_540.png" alt="image" data-orig-width="884" data-orig-height="816"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 5: Word clouds of discriminative queries. Top: men (left) vs. women (right). Bottom: young (left) vs. senior citizens (right).&lt;/center&gt;&lt;p&gt;&lt;br/&gt;Both our results-focused and query-focused approaches improve search results and serve to encourage users to search better and more often. These improvements are critical for users to trust that they can retrieve the messages they are looking for since they won’t get lost under a pile of less-relevant search results. This is important, because using the search function is the most efficient way to retrieve data; active users have a large amount of important information in their mailbox they want to regularly retrieve, and most of them do not use folders or other means of organizing their messages. In fact, we know that only 30% of our users create folders, and of those people, only 10% actually use them. In other words, for most users, search is the only way to re-find a message. &lt;/p&gt;&lt;p&gt;As we move forward to improve this effort, we continue to look at all aspects of mail – both on the &lt;i&gt;algorithmic&lt;/i&gt; side, by analyzing not only messages, but also attachments, links, photos, invites, and so on; and on the &lt;i&gt;user experience&lt;/i&gt; side – to better understand our users’ behaviors and make sure we answer their true, rather than perceived, needs.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Acknowledgements:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;This multiple-year work has been published at several top conferences in the last few years and we are grateful to all our co-authors who not only invested so much efforts in this research but published about it: &lt;a href="https://research.yahoo.com/researchers/dcarmel" target="_blank"&gt;David Carmel&lt;/a&gt;, &lt;a href="https://research.yahoo.com/researchers/ghalawi" target="_blank"&gt;Guy Halawi&lt;/a&gt;, &lt;a href="https://research.yahoo.com/researchers/alibov" target="_blank"&gt;Alex Libov&lt;/a&gt;, and &lt;a href="https://research.yahoo.com/researchers/arielr" target="_blank"&gt;Ariel Raviv&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;A huge thanks to the entire Yahoo Mail engineering and product team. The list of our friends and colleagues there is too long to be fully listed here but nothing would not have happened without their extraordinary support and partnership.&lt;/p&gt;&lt;p&gt;&lt;b&gt;References:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;[1] David Carmel, Guy Halawi, Liane Lewin-Eytan, Yoelle Maarek and Ariel Raviv. &lt;a href="https://research.yahoo.com/publications/8655/rank-time-or-relevance-revisiting-email-search" target="_blank"&gt;Rank by time of by relevance? Revisiting Email Search&lt;/a&gt;. CIKM'2015, Melbourne Australia, Oct 2015.&lt;/p&gt;&lt;p&gt;[2] David Carmel, Liane Lewin-Eytan, Alexander Libov, Yoelle Maarek and Ariel Raviv. &lt;a href="https://research.yahoo.com/publications/8870/promoting-relevant-results-time-ranked-mail-search" target="_blank"&gt;Promoting Relevant Results in Time-Ranked Mail Search&lt;/a&gt;. WWW’2017&lt;/p&gt;&lt;p&gt;[3] David Carmel, Liane Lewin-Eytan, Alexander Libov, Yoelle Maarek and Ariel Raviv. &lt;a href="https://research.yahoo.com/publications/8869/demographics-mail-search-and-their-application-query-suggestion" target="_blank"&gt;The Demographics of Mail Search and their Application to Query Suggestion&lt;/a&gt;. WWW’2017&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/159206454771</link><guid>https://yahooresearch.tumblr.com/post/159206454771</guid><pubDate>Tue, 04 Apr 2017 16:30:21 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo Mail</category><category>Search</category><category>Mail</category><category>email</category><category>relevance</category><category>WWW2017</category></item><item><title>Adapting to the Evolution of Email with Machine-Generated Mail Mining</title><description>&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/liane" target="_blank"&gt;Liane Lewin-Eytan&lt;/a&gt; and &lt;a href="https://research.yahoo.com/researchers/yoelle" target="_blank"&gt;Yoelle Maarek&lt;/a&gt;&lt;br/&gt;&lt;/p&gt;&lt;p&gt;The mail experience has not evolved much in the last few decades as compared to other communication channels. At the same time, personal communications have exploded  with the advent and growth of numerous new communication and social networking apps. This might lead you to believe mail is on its way to a slow death. We beg to differ.&lt;/p&gt;&lt;p&gt;As messaging, video chatting, and other social networking methods have reached adolescence, mail has entered its mid-life (without the crisis) while its traffic has significantly changed and evolved. A new type of mail traffic has emerged with the rise of online transactions, including online purchases, financial transactions, travel plans, event notifications, and many others. As a result, the Web mail domain has become dominated by what we call “machine-generated” messages; that is, emails that are generated (usually by companies) via scripts rather than by humans. Following this essential observation, it makes sense that you might want to be able to distinguish traffic generated by machines from that generated by humans. The use cases are numerous: from being able to provide views gathering similar types of messages (personal, travel, purchases, etc.) as surfaced recently in Yahoo Mail (see Figure 1) and Gmail, to being able to provide a user experience tailored to the type of email you are looking at (e.g., you wouldn’t want to provide a “reply” option to a “noreply@” machine-generated address).&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="194" data-orig-height="519"&gt;&lt;img src="https://66.media.tumblr.com/c973d8c0cd721af25e2d45b9aede0dc7/tumblr_inline_ondjwmR3dh1rgj0aw_540.png" alt="image" data-orig-width="194" data-orig-height="519"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 1: Yahoo Mail “Smart Views” – here users can explore emails that are automatically categorized by topic&lt;/center&gt;&lt;p&gt;&lt;br/&gt;At Yahoo Research, we have developed a new classifying technology that distinguishes between human- and machine-generated mail. This “Human/Machine” classifier is based on a wide range of features, such as:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;sender and traffic characteristics&lt;/i&gt; – a machine can generate large traffic bursts sent to a large number of recipients, while a human cannot&lt;/li&gt;&lt;li&gt;&lt;i&gt;semantic attributes&lt;/i&gt; – various keywords repeating in machine-generated traffic of all types&lt;/li&gt;&lt;li&gt;&lt;i&gt;structural attributes&lt;/i&gt; – messages generated by machines typically have complex HTML structures, while those composed by human are rather flat&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;Our classifier now achieves a performance of 90% precision and 90% recall for both categories. This means that 90% of the messages actually composed by human beings are indeed classified as “human,” and out of those classified as “human,” we are correct in 90% of the cases. The same goes for machines.&lt;/p&gt;&lt;p&gt;Given the high degree of accuracy with which we can distinguish between the two types of email traffic, we felt confident launching &lt;a href="https://yahoomail.tumblr.com/post/152342564806/an-inbox-experience-as-unique-as-you-are" target="_blank"&gt;people-only notifications in Yahoo Mail&lt;/a&gt; across all platforms. This new feature (see Figure 2) allows you to get “people-only” notifications. In other words, you can turn on the option to receive a notification only when a person emails you, or you can turn it off and receive a notification for any new incoming message.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1125" data-orig-height="1999" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/18bb60fcbb881a1e984490c561f6559a/tumblr_inline_ondjwxs8G51rgj0aw_540.png" alt="image" data-orig-width="1125" data-orig-height="1999"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 2: “People only” notifications settings enabled&lt;/center&gt;&lt;p&gt;&lt;br/&gt;While our users have said they enjoy the &amp;ldquo;people only&amp;rdquo; feature, what really excites us on the Yahoo Mail Mining team in Haifa, is the opportunity presented by machine-generated emails. We know that machines account for 90% of all mail traffic [4]. These machine-generated messages, whether they are purchase receipts, flight reservations, or something else, contain loads of personal information. So in many ways, the mailbox serves as a personal data store. Unpacking that data in a meaningful way presents an incredible opportunity to advance the mail experience for our users.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Machine-Generated Mail Mining&lt;/b&gt;&lt;/p&gt;&lt;p&gt;This line of research, based on the difference between human- and machine-generated mail traffic, was initiated when we investigated better means of mail classification [4][6] and has served different use cases in recent years [3][5], including mail anonymization [2]. Today, it mainly serves automatic mail extraction [1]. With mail extraction, what we attempt to achieve is an automated way of extracting the “personalized” (and thus more meaningful to the user) parts of messages created by automated scripts; more specifically, those parts that are either of high interest to you (the items you purchased, their date of delivery, the details of your trip, etc.), or those that have a business value (e.g., the advertisements we surface). Anonymization techniques that we developed precisely for machine-generated traffic (see Figure 3) allow us to preserve a user’s privacy [2] and adhere to formal PII terms of service.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="755" data-orig-height="726" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/9edad7e652ceb6de2ceff790a039c79a/tumblr_inline_ondjxcsbMK1rgj0aw_540.jpg" alt="image" data-orig-width="755" data-orig-height="726"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 3: Anonymized mail sample showing extraction fields&lt;/center&gt;&lt;p&gt;&lt;br/&gt;Mail extraction is a process composed of two main phases: &lt;b&gt;clustering&lt;/b&gt; the messages and &lt;b&gt;extracting&lt;/b&gt; the data. Why do we cluster before extracting? Because if clustered correctly, a single extraction rule can be applied over an entire cluster, and therefore needs to be defined only once per cluster.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Clustering&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Clustering mail messages is performed horizontally, exploiting the similarities of machine-generated messages sent en masse with (usually) complex HTML structures. Using the recurrent characteristics of these messages, clusters are created, optimally matching the scripts generating the messages. Previous clustering techniques relied only on the message header [5] and mainly looked for similarities in the messages’ subjects. Today, the state-of-the-art clustering techniques rely on the body of messages (i.e., their structures) [1][2]. These techniques detect similarities in the HTML structure of the messages and allow for flexible matching, including some small differences in the structure. More flexible clustering results in fewer clusters and ease processes that require some maintenance or human intervention.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="876" data-orig-height="835" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/ae67520c4da23746d9daa92e3f40e72f/tumblr_inline_ondjxla6F61rgj0aw_540.png" alt="image" data-orig-width="876" data-orig-height="835"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 4: Cluster of Amazon purchase confirmations&lt;/center&gt;&lt;figure data-orig-width="658" data-orig-height="502" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/4ce6bbfd64653e95ba979156adcc925d/tumblr_inline_ondjxuFC5O1rgj0aw_540.png" alt="image" data-orig-width="658" data-orig-height="502"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;Figure 5: A chart representing the distribution of clusters between different categories&lt;/center&gt;&lt;figure data-orig-width="644" data-orig-height="541" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/7acda67645d2bd564c17511875b1e541/tumblr_inline_ondjy4QMrJ1rgj0aw_540.png" alt="image" data-orig-width="644" data-orig-height="541"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 6: Level of structural flexibility that can be allowed, while still guaranteeing high extraction quality (flexibility represented by edit distance)&lt;/center&gt;&lt;p&gt;&lt;br/&gt;&lt;b&gt;Extractions&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Once the clusters have been created, and we have guaranteed that all messages within a cluster are similar with regard to structure, we can move on to our next phase: mail extraction. Currently at Yahoo, some manual work is involved in defining extraction rules for interpreting some pieces of messages and for validating parts of the process. Extraction rules are defined per cluster and are applied online for each message entering the system after identifying the cluster to which it belongs.&lt;/p&gt;&lt;p&gt;The fact that mail extraction requires some interpretability during intermediate phases of its cycle is a bottleneck, which prevents scalability and coverage of the long tail. A fully-automated method for creating extraction rules that cover all machine-generated traffic is in the advance stages of development. Our process is based on the similarity of messages guaranteed by the cluster, which is a crucial attribute used for identifying and annotating the pieces of information we want to extract. Figure 7 below is an example of a rule created automatically. It is defined over an &lt;i&gt;Xpath&lt;/i&gt; [7], where an Xpath is simply a pointer to a specific location in the message. This rule defines the fields of interest to be extracted from this location and provides their full annotations.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="967" data-orig-height="201" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/f8b19960f55ce911d8db82f09acda7ec/tumblr_inline_ondjyg7V011rgj0aw_540.png" alt="image" data-orig-width="967" data-orig-height="201"/&gt;&lt;/figure&gt;&lt;center&gt;Figure 7: An extraction rule&lt;/center&gt;&lt;p&gt;&lt;br/&gt;As we continue to develop our machine-generated mail mining techniques, we hope to broaden our approach and share this research in an effort to encourage others to do the same. As a domain that has significantly changed its nature in recent years, mail deserves a reexamination of its scientific foundation and more attention within the research community.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Acknowledgements:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;This multiple-year work has been published at several top conferences in the last few years and we are grateful to all our co-authors who not only invested so much effort in this research, but also published about it:
Nir Ailon, &lt;a href="https://research.yahoo.com/researchers/noa-avigdor-elgrabli" target="_blank"&gt;Noa Avigdor-Elgrabli&lt;/a&gt;, Marc Cwalinski, &lt;a href="https://research.yahoo.com/researchers/dot" target="_blank"&gt;Dotan DiCastro&lt;/a&gt;, &lt;a href="https://research.yahoo.com/researchers/iftah" target="_blank"&gt;Iftah Gamzu&lt;/a&gt;, &lt;a href="https://research.yahoo.com/researchers/iragz" target="_blank"&gt;Ira Grabovitch-Zuyev&lt;/a&gt;, Mihajlo Grbovic, &lt;a href="https://research.yahoo.com/researchers/ghalawi" target="_blank"&gt;Guy Halawi&lt;/a&gt;, Yehuda Koren, &lt;a href="https://research.yahoo.com/researchers/zkarnin" target="_blank"&gt;Zohar Karnin&lt;/a&gt;, Edo Liberty,  Roman Sandler, David Wajc, &lt;a href="https://research.yahoo.com/researchers/ranw" target="_blank"&gt;Ran Wolff&lt;/a&gt;, and Eyal Zohar.&lt;/p&gt;&lt;p&gt;A huge thanks to the entire Yahoo Mail engineering and product team. The list of our friends and colleagues there is too long to be fully listed here but this couldn&amp;rsquo;t have happened without their extraordinary support and partnership.&lt;/p&gt;&lt;p&gt;&lt;b&gt;References:&lt;/b&gt;&lt;/p&gt;&lt;p&gt;[1] Noa Avigdor-Elgrabli, Mark Cwalinskiy, Dotan Di Castro, Iftah Gamzu, Irena Grabovitch-Zuyev, Liane Lewin-Eytan, Yoelle Maarek. &lt;a href="https://research.yahoo.com/publications/8866/structural-clustering-machine-generated-mail" target="_blank"&gt;&lt;i&gt;Structural Clustering of Machine-Generated Mail&lt;/i&gt;&lt;/a&gt;. CIKM 2016.&lt;/p&gt;&lt;p&gt;[2] Dotan Di Castro, Liane Lewin-Eytan, Yoelle Maarek, Ran Wolff and Eyal Zohar, &lt;a href="https://research.yahoo.com/publications/8614/enforcing-k-anonymity-web-mail-auditing" target="_blank"&gt;&lt;i&gt;Enforcing k-anonymity in Web mail auditing&lt;/i&gt;&lt;/a&gt;. WSDM'2016, San Francisco, CA, Feb 2016.&lt;/p&gt;&lt;p&gt;[3] Iftah Gamzu, Zohar Shay Karnin, Yoelle Maarek and David Wajc. &lt;a href="https://research.yahoo.com/publications/6770/you-will-get-mail-predicting-arrival-future-email" target="_blank"&gt;&lt;i&gt;You Will Get Mail! Predicting the Arrival of Future Email&lt;/i&gt;&lt;/a&gt;. TempWeb 2105, Florence Italy, May 2015.&lt;/p&gt;&lt;p&gt;[4] Mihajlo Grbovic, Guy Halawi, Zohar Karnin and Yoelle Maarek. &lt;a href="https://research.yahoo.com/publications/6714/how-many-folders-do-you-really-need-classifying-email-handful-categories" target="_blank"&gt;&lt;i&gt;How many folders do you really need? Classifying email into a handful of categories&lt;/i&gt;&lt;/a&gt;. CIKM’2014, Shanghai, China, Nov 2014.&lt;/p&gt;&lt;p&gt;[5] Nir Ailon, Zohar Karnin, Edo Liberty and Yoelle Maarek. &lt;a href="https://research.yahoo.com/publications/6551/threading-machine-generated-email" target="_blank"&gt;&lt;i&gt;Threading Machine Generated Email&lt;/i&gt;&lt;/a&gt;. WSDM 2013.&lt;/p&gt;&lt;p&gt;[6] Y. Koren, E. Liberty, Y. Maarek, and R. Sandler. &lt;a href="https://research.yahoo.com/publications/6245/automatically-tagging-email-leveraging-other-users-folders" target="_blank"&gt;&lt;i&gt;Automatically tagging email by leveraging other users&amp;rsquo; folders&lt;/i&gt;&lt;/a&gt;. KDD, 2011.&lt;/p&gt;&lt;p&gt;[7] W3C. XML Path Language (XPath) Version 1.0. &lt;br/&gt;&lt;a href="http://www.w3.org/TR/xpath/" target="_blank"&gt;http://www.w3.org/TR/xpath/&lt;/a&gt;, November 1999&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/159206449901</link><guid>https://yahooresearch.tumblr.com/post/159206449901</guid><pubDate>Tue, 04 Apr 2017 16:30:11 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo Mail</category><category>Mail</category><category>email</category><category>Mail Mining</category></item><item><title>Introducing Similarity Search at Flickr</title><description>&lt;p&gt;By Clayton Mellina, Software Development Engineer&lt;br/&gt;&lt;/p&gt;&lt;p&gt;At Yahoo, our Computer Vision team works closely with Flickr, one of the world’s largest photo-sharing communities. The billions of photos hosted by Flickr allow us to tackle some of the most interesting real-world problems in image and video understanding. One of those major problems is that of discovery. We understand that the value in our photo corpus is only unlocked when the community can find photos and photographers that inspire them, so we strive to enable the discovery and appreciation of new photos.&lt;/p&gt;&lt;p&gt;To further that effort, today we are introducing &lt;b&gt;similarity search&lt;/b&gt; on Flickr. If you hover over a photo on a search result page, you will reveal a “&amp;hellip;” button that exposes a menu that gives you the option to search for photos similar to the photo you are currently viewing.&lt;/p&gt;&lt;p&gt;In many ways, photo search is very different from traditional web or text search. First, the goal of web search is usually to satisfy a particular information need, while with photo search the goal is often one of &lt;i&gt;discovery&lt;/i&gt;; as such, it should be delightful as well as functional. We have taken this to heart throughout Flickr. For instance, our color search feature, which allows filtering by color scheme, and our style filters, which allow filtering by styles such as “minimalist” or “patterns,” encourage exploration. Second, in traditional web search, the goal is usually to match documents to a set of keywords in the query. That is, the query is in the same modality—text—as the documents being searched. Photo search usually matches &lt;i&gt;across&lt;/i&gt; modalities: text to image. Text querying is a necessary feature of a photo search engine, but, as the saying goes, a picture is worth a thousand words. And beyond saving people the effort of so much typing, many visual concepts genuinely defy accurate description. Now, we’re giving our community a way to easily explore those visual concepts with the “&amp;hellip;” button, a feature we call the &lt;b&gt;similarity pivot&lt;/b&gt;.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/yahooresearch/demo.gif" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/yahooresearch/demo.gif" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="658" data-orig-width="1277" data-orig-src="https://s.yimg.com/ge/yahooresearch/demo.gif"&gt;&lt;img src="https://66.media.tumblr.com/52252f8e8e2763858d29733a14caba97/tumblr_inline_p7k6i2HGzF1rgj0aw_540.gif" alt="image" data-orig-height="658" data-orig-width="1277" data-orig-src="https://s.yimg.com/ge/yahooresearch/demo.gif"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;The similarity pivot is a significant addition to the Flickr experience because it offers our community an entirely new way to explore and discover the billions of incredible photos and millions of incredible photographers on Flickr. It allows people to look for &lt;a href="https://www.flickr.com/search/?similarity_id=29327172003" target="_blank"&gt;images of a particular style&lt;/a&gt;, it gives people a view into &lt;a href="https://www.flickr.com/search/?similarity_id=5742058855" target="_blank"&gt;universal behaviors&lt;/a&gt;, and even when it “messes up,” it can force people to look at the &lt;a href="https://www.flickr.com/search/?similarity_id=14198128453" target="_blank"&gt;unexpected&lt;/a&gt; &lt;a href="https://www.flickr.com/search/?similarity_id=28863966765" target="_blank"&gt;commonalities&lt;/a&gt; and &lt;a href="https://www.flickr.com/search/?similarity_id=8002923505" target="_blank"&gt;oddities&lt;/a&gt; of our visual world with a &lt;a href="https://www.flickr.com/search/?similarity_id=13759436465" target="_blank"&gt;fresh&lt;/a&gt; &lt;a href="https://www.flickr.com/search/?similarity_id=15346205045" target="_blank"&gt;perspective&lt;/a&gt;.&lt;/p&gt;&lt;p&gt;&lt;b&gt;What is “similarity?”&lt;/b&gt;&lt;/p&gt;&lt;p&gt;To understand how an experience like this is powered, we first need to understand what we mean by “similarity.” There are many ways photos can be similar to one another. Consider some examples.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1033" data-orig-height="443" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/b9a461c6958c1cb5c6044970ada8bd01/tumblr_inline_omgc78kW8U1rgj0aw_540.png" alt="image" data-orig-width="1033" data-orig-height="443"/&gt;&lt;/figure&gt;&lt;figure data-orig-width="1030" data-orig-height="442" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/987c659981b347c3c8bfe4e65fa180e2/tumblr_inline_omgc7grWpW1rgj0aw_540.png" alt="image" data-orig-width="1030" data-orig-height="442"/&gt;&lt;/figure&gt;&lt;figure data-orig-width="1029" data-orig-height="443" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/699cf0fc0fb103b251a961671c0cccd2/tumblr_inline_omgc7k95kH1rgj0aw_540.png" alt="image" data-orig-width="1029" data-orig-height="443"/&gt;&lt;/figure&gt;&lt;p&gt;It is apparent that all of these groups of photos illustrate some notion of “similarity,” but each is different. Roughly, they are: similarity of color, similarity of texture, and similarity of semantic category. And there are many others that you might imagine as well.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;What notion of similarity is best suited for a site like Flickr? Ideally, we’d like to be able to capture multiple types of similarity, but we decided early on that semantic similarity—similarity based on the semantic content of the photos—was vital to wholly facilitate discovery on Flickr. This requires a deep understanding of image content for which we employ deep neural networks.&lt;/p&gt;&lt;p&gt;We have been using deep neural networks at Flickr for a while for various tasks such as object recognition, NSFW prediction, and even prediction of aesthetic quality. For these tasks, we train a neural network to map the raw pixels of a photo into a set of relevant tags, as illustrated below.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1408" data-orig-height="404" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/58e4dfb98c170f85ec3f33dee7e3c2a3/tumblr_inline_omgcbv2ONL1rgj0aw_540.png" alt="image" data-orig-width="1408" data-orig-height="404"/&gt;&lt;/figure&gt;&lt;p&gt;Internally, the neural network accomplishes this mapping incrementally by applying a series of transformations to the image, which can be thought of as a vector of numbers corresponding to the pixel intensities. Each transformation in the series produces another vector, which is in turn the input to the next transformation, until finally we have a vector that we specifically constrain to be a list of probabilities for each class we are trying to recognize in the image. To be able to go from raw pixels to a semantic label like “hot air balloon,” the network discards lots of information about the image, including information about  appearance, such as the color of the balloon, its relative position in the sky, etc. Instead, we can extract an internal vector in the network before the final output.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1244" data-orig-height="712" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/71275e89f435ceea1fc011cf6f2971b2/tumblr_inline_omgcdbU3pA1rgj0aw_540.png" alt="image" data-orig-width="1244" data-orig-height="712"/&gt;&lt;/figure&gt;&lt;p&gt;For common neural network architectures, this vector—which we call a “feature vector”—has many hundreds or thousands of dimensions. We can’t necessarily say with certainty that any one of these dimensions means something in particular as we could at the final network output, whose dimensions correspond to tag probabilities. But these vectors have an important property: when you compute the &lt;a href="https://en.wikipedia.org/wiki/Euclidean_distance" target="_blank"&gt;Euclidean distance&lt;/a&gt; between these vectors, images containing similar content will tend to have feature vectors closer together than images containing dissimilar content. You can think of this as a way that the network has learned to organize information present in the image so that it can output the required class prediction. This is exactly what we are looking for: Euclidian distance in this high-dimensional feature space is a measure of semantic similarity. The graphic below illustrates this idea: points in the neighborhood around the query image are semantically similar to the query image, whereas points in neighborhoods further away are not.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1686" data-orig-height="772" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/e62282bcd597aa65f4e5a134bfc093ff/tumblr_inline_omgceyWZUE1rgj0aw_540.png" alt="image" data-orig-width="1686" data-orig-height="772"/&gt;&lt;/figure&gt;&lt;p&gt;This measure of similarity is not perfect and cannot capture all possible notions of similarity—it will be constrained by the particular task the network was trained to perform, i.e., scene recognition. However, it is effective for our purposes, and, importantly, it contains information beyond merely the semantic content of the image, such as appearance, composition, and texture. Most importantly, it gives us a simple algorithm for finding visually similar photos: compute the distance in the feature space of a query image to each index image and return the images with lowest distance. Of course, there is much more work to do to make this idea work for billions of images.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;Large-scale approximate nearest neighbor search&lt;/b&gt;&lt;/p&gt;&lt;p&gt;With an index as large as Flickr’s, computing distances exhaustively for each query is intractable. Additionally, storing a high-dimensional floating point feature vector for each of billions of images takes a large amount of disk space and poses even more difficulty if these features need to be in memory for fast ranking. To solve these two issues, we adopt a state-of-the-art approximate nearest neighbor algorithm called &lt;a href="http://image.ntua.gr/iva/files/lopq.pdf" target="_blank"&gt;Locally Optimized Product Quantization&lt;/a&gt; (LOPQ).&lt;/p&gt;&lt;p&gt;To understand LOPQ, it is useful to first look at a simple strategy. Rather than ranking all vectors in the index, we can first filter a set of good candidates and only do expensive distance computations on them. For example, we can use an algorithm like &lt;a href="https://en.wikipedia.org/wiki/K-means_clustering" target="_blank"&gt;&lt;i&gt;k&lt;/i&gt;-means&lt;/a&gt; to cluster our index vectors, find the cluster to which each vector is assigned, and index the corresponding cluster id for each vector. At query time, we find the cluster that the query vector is assigned to and fetch the items that belong to the same cluster from the index. We can even expand this set if we like by fetching items from the next nearest cluster.&lt;/p&gt;&lt;p&gt;This idea will take us far, but not far enough for a billions-scale index. For example, with 1 billion photos, we need 1 million clusters so that each cluster contains an average of 1000 photos. At query time, we will have to compute the distance from the query to each of these 1 million cluster centroids in order to find the nearest clusters. This is quite a lot. We can do better, however, if we instead split our vectors in half by dimension and cluster each half separately. In this scheme, each vector will be assigned to a pair of cluster ids, one for each half of the vector. If we choose k = 1000 to cluster both halves, we have k2 = 1000 * 1000 = 1e6 possible pairs. In other words, by clustering each half separately and assigning each item a pair of cluster ids, we can get the same granularity of partitioning (1 million clusters total) with only 2*1000 distance computations with half the number of dimensions for a total computational savings of 1000x. Conversely, for the same computational cost, we gain a factor of k more partitions of the data space, providing a much finer-grained index.&lt;/p&gt;&lt;p&gt;This idea of splitting vectors into subvectors and clustering each split separately is called &lt;a href="https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf" target="_blank"&gt;&lt;i&gt;product quantization&lt;/i&gt;&lt;/a&gt;. When we use this idea to index a dataset it is called the &lt;a href="http://cache-ash04.cdn.yandex.net/download.yandex.ru/company/cvpr2012.pdf" target="_blank"&gt;&lt;i&gt;inverted multi-index&lt;/i&gt;&lt;/a&gt;, and it forms the basis for fast candidate retrieval in our similarity index. Typically the distribution of points over the clusters in a multi-index will be unbalanced as compared to a standard k-means index, but this unbalance is a fair trade for the much higher resolution partitioning that it buys us. In fact, a multi-index will only be balanced across clusters if the two halves of the vectors are perfectly statistically independent. This is not the case in most real world data, but some heuristic preprocessing—like &lt;a href="https://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank"&gt;PCA-ing&lt;/a&gt; and permuting the dimensions so that the cumulative per-dimension variance is approximately balanced between the halves—helps in many cases. And just like the simple k-means index, there is a fast algorithm for finding a ranked list of clusters to a query if we need to expand the candidate set.&lt;/p&gt;&lt;p&gt;After we have a set of candidates, we must rank them. We could store the full vector in the index and use it to compute the distance for each candidate item, but this would incur a large memory overhead (for example, 256 dimensional vectors of 4 byte floats would require 1Tb for 1 billion photos) as well as a computational overhead. LOPQ solves these issues by performing another product quantization, this time on the &lt;i&gt;residuals&lt;/i&gt; of the data. The residual of a point is the difference vector between the point and its closest cluster centroid. Given a residual vector and the cluster indexes along with the corresponding centroids, we have enough information to reproduce the original vector exactly. Instead of storing the residuals, LOPQ product quantizes the residuals, usually with a higher number of splits, and stores only the cluster indexes in the index. For example, if we split the vector into 8 splits and each split is clustered with 256 centroids, we can store the compressed vector with only 8 bytes regardless of the number of dimensions to start (though certainly a higher number of dimensions will result in higher approximation error). With this &lt;a href="https://en.wikipedia.org/wiki/Lossy_compression" target="_blank"&gt;lossy representation&lt;/a&gt; we can produce a reconstruction of a vector from the 8 byte codes: we simply take each quantization code, look up the corresponding centroid, and concatenate these 8 centroids together to produce a reconstruction. Likewise, we can approximate the distance from the query to an index vector by computing the distance between the query and the reconstruction. We can do this computation quickly for many candidate points by computing the squared difference of each split of the query to all of the centroids for that split. After computing this table, we can compute the squared difference for an index point by looking up the precomputed squared difference for each of the 8 indexes and summing them together to get the total squared difference. This caching trick allows us to quickly rank many candidates without resorting to distance computations in the original vector space.&lt;/p&gt;&lt;p&gt;LOPQ adds one final detail: for each cluster in the multi-index, LOPQ fits a local rotation to the residuals of the points that fall in that cluster. This rotation is simply a PCA that aligns the major directions of variation in the data to the axes followed by a permutation to heuristically balance the variance across the splits of the product quantization. Note that this is the exact preprocessing step that is usually performed at the top-level multi-index. It tends to make the approximate distance computations more accurate by mitigating errors introduced by assuming that each split of the vector in the production quantization is statistically independent from other splits. Additionally, since a rotation is fit for each cluster, they serve to fit the local data distribution better.&lt;/p&gt;&lt;p&gt;Below is a diagram from the LOPQ paper that illustrates the core ideas of LOPQ. K-means (a) is very effective at allocating cluster centroids, illustrated as red points, that target the distribution of the data, but it has other drawbacks at scale as discussed earlier. In the 2d example shown, we can imagine product quantizing the space with 2 splits, each with 1 dimension. Product Quantization (b) clusters each dimension independently and cluster centroids are specified by pairs of cluster indexes, one for each split. This is effectively a grid over the space. Since the splits are treated as if they were statistically independent, we will, unfortunately, get many clusters that are “wasted” by not targeting the data distribution. We can improve on this situation by rotating the data such that the main dimensions of variation are axis-aligned. This version, called Optimized Product Quantization &amp;copy;, does a better job of making sure each centroid is useful. LOPQ (d) extends this idea by first coarsely clustering the data and then doing a separate instance of OPQ for each cluster, allowing highly targeted centroids while still reaping the benefits of product quantization in terms of scalability.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="1022" data-orig-height="880" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/5527f0056a2b03c1553c68133b7b39a7/tumblr_inline_omgd23nKEF1rgj0aw_540.png" alt="image" data-orig-width="1022" data-orig-height="880"/&gt;&lt;/figure&gt;&lt;p&gt;LOPQ is state-of-the-art for quantization methods, and you can find more information about the algorithm, as well as benchmarks, &lt;a href="http://image.ntua.gr/iva/research/lopq/" target="_blank"&gt;here&lt;/a&gt;. Additionally, we provide an &lt;a href="https://github.com/yahoo/lopq" target="_blank"&gt;open-source implementation&lt;/a&gt; in Python and Spark which you can apply to your own datasets. The algorithm produces a set of cluster indexes that can be queried efficiently in an inverted index, as described. We have also explored use cases that use these indexes as a hash for fast deduplication of images and large-scale clustering. These extended use cases are studied &lt;a href="https://arxiv.org/abs/1604.06480" target="_blank"&gt;here&lt;/a&gt;.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;/p&gt;&lt;p&gt;We have described our system for large-scale visual similarity search at Flickr. Techniques for producing high-quality vector representations for images with deep learning are constantly improving, enabling new ways to search and explore large multimedia collections. These techniques are being applied in other domains as well to, for example, produce vector representations for &lt;a href="https://en.wikipedia.org/wiki/Word2vec" target="_blank"&gt;text&lt;/a&gt;, &lt;a href="https://arxiv.org/pdf/1502.04681.pdf" target="_blank"&gt;video&lt;/a&gt;, and even &lt;a href="https://arxiv.org/pdf/1603.00856.pdf" target="_blank"&gt;molecules&lt;/a&gt;. Large-scale approximate nearest neighbor search has importance and potential application in these domains as well as many others. Though these techniques are in their infancy, we hope similarity search provides a useful new way to appreciate the amazing collection of images at Flickr and surface photos of interest that may have previously gone undiscovered. We are excited about the future of this technology at Flickr and beyond.&lt;/p&gt;&lt;p&gt;&lt;b&gt;&lt;i&gt;Acknowledgements&lt;/i&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;&lt;i&gt;Yannis Kalantidis, Huy Nguyen, Stacey Svetlichnaya, Arel Cordero. Special thanks to the rest of the Computer Vision and Machine Learning team and the Vespa search team who manages Yahoo’s internal search engine.&lt;/i&gt;&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/158115871236</link><guid>https://yahooresearch.tumblr.com/post/158115871236</guid><pubDate>Tue, 07 Mar 2017 10:00:01 -0800</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Flickr</category><category>Computer Vision</category><category>Similarity</category><category>Search</category><category>Deep Learning</category><category>Similarity Search</category></item><item><title>Researching the Future of Automated Question-Answering</title><description>&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/dpelleg" target="_blank"&gt;Dan Pelleg&lt;/a&gt;, Yuval Pinter, and &lt;a href="https://research.yahoo.com/researchers/dcarmel" target="_blank"&gt;David Carmel&lt;/a&gt;&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;There’s no denying it: chat bots are en vogue and it seems like everyone is experimenting with the technology. Facebook &lt;a href="http://newsroom.fb.com/news/2016/04/messenger-platform-at-f8/" target="_blank"&gt;introduced bots&lt;/a&gt; for their Messenger Platform, Microsoft &lt;a href="https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/#sm.001kb9j1i11gbeozuxn1kmckukq6f" target="_blank"&gt;launched a controversial chatbot&lt;/a&gt; on Twitter called Tay, and at Yahoo, we’ve &lt;a href="https://yahoo.tumblr.com/post/153480238449/yahoo-brings-on-the-fun-bots-trivia-with-friends" target="_blank"&gt;released our own bots on Kik and Messenger&lt;/a&gt;. With so much interest, we’re doing our part as research scientists to advance the state-of-the-art in question-answering, which, among other things, will lead to conversational bots appearing more human. &lt;/p&gt;&lt;p&gt;For the second year in a row, the Yahoo Research Text Mining team in Haifa, in collaboration with Emory University and the U.S. National Institute of Standards and Technology, ran a shared challenge for question-answering called LiveQA as part of the annual &lt;a href="http://trec.nist.gov/" target="_blank"&gt;Text REtrieval Conference&lt;/a&gt; (TREC). &lt;/p&gt;&lt;p&gt;During the 24-hour duration of the competition, 14 teams from the USA, China, Germany, Australia, Canada, Israel, and Qatar participated, though preparations for the competition actually started a few months earlier. Each team developed a Web service that, as input, received a free-form question, and then responded with an answer. The answer could not be longer than 1000 characters, and needed to be returned within one minute. The questions used, from Yahoo Answers, were different than the factoid questions used in some previous TREC QA tracks. Yahoo Answers questions are much more diverse and cover multiple question types, such as opinion, advice, and polls. This made the task far more realistic and challenging.&lt;br/&gt;&lt;/p&gt;&lt;figure data-orig-width="745" data-orig-height="549" class="tmblr-full"&gt;&lt;img src="https://66.media.tumblr.com/70e9c40a80e4224b567e5a44a4fc416b/tumblr_inline_oii2ayIBIW1rgj0aw_540.png" data-orig-width="745" data-orig-height="549"/&gt;&lt;/figure&gt;&lt;p&gt;During the competition day, the TREC participants received a total of 1,088 questions and sent back a total of over 21,000 answers. They usually had no issue with either the time limit (24 seconds to answer, on average) or the length limit (599 characters per average answer).&lt;/p&gt;&lt;p&gt;The answers were judged manually on a four-point scale (excellent/good/fair/poor). The best system in terms of performance was the “EmoryCrowd” system from &lt;a href="http://ir.mathcs.emory.edu/" target="_blank"&gt;Emory&lt;/a&gt;. This was a unique system, because it used a combination of computational and human labor. First, it computed candidate answers algorithmically, and then it turned to crowdsourced work to rank them – all in less than 60 seconds. This “cyborg” system provided the highest-quality answers by a noticeable margin, trailed by fully-algorithmic systems from CMU, Emory, and Yahoo Research. &lt;/p&gt;&lt;p&gt;Human intellect was shown to be superior not just within hybrid systems, but also by itself, when we had the judges also evaluate the answers given by users on the original Yahoo Answers site. These answers were significantly better than any automated or hybrid system. Of importance though, the Yahoo Answers users had no time or space limit – if they wanted to, they could work on their answers for up to one week, which would give them an obvious advantage. This is just one reason to keep pushing the envelope on automatic question answering. Computers are faster than people for many tasks, they are available 24/7 (if the power is on), and they scale more readily.&lt;/p&gt;&lt;p&gt;Compared to last year’s challenge, the quality of results (i.e. the answers) improved significantly. However, the quality was still below that of the human responders’. We also introduced a new task for question summarization, and this task is far from being solved. Our plan is to run the LiveQA challenge next year, thus allowing the participants to further improve and extend their systems. We hope that additional teams will join this joint research effort of answering real users’ questions in real-time, with the goal of encouraging progress in the field of Natural Language Processing as it relates to question-answering. Who knows, in the future, maybe our learnings will find their way to a bot near you!&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/154732225051</link><guid>https://yahooresearch.tumblr.com/post/154732225051</guid><pubDate>Tue, 20 Dec 2016 12:13:09 -0800</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo Israel</category><category>Research</category><category>Haifa</category><category>bots</category><category>chat bots</category><category>Q and A</category><category>NLP</category></item><item><title>Presenting an Open Source Toolkit for Lightweight Multilingual Entity Linking</title><description>&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/aasishkp" target="_blank"&gt;Aasish Pappu&lt;/a&gt;, Roi Blanco, and Amanda Stent&lt;/p&gt;&lt;p&gt;What’s the first thing you want to know about any kind of text document (like a Yahoo News or Yahoo Sports article)? What it’s about, of course! That means you want to know something about the people, organizations, and locations that are mentioned in the document. Systems that automatically surface this information are called &lt;i&gt;named entity recognition and linking&lt;/i&gt; systems. These are one of the most useful components in text analytics as they are required for a wide variety of applications including search, recommender systems, question answering, and sentiment analysis.&lt;/p&gt;&lt;p&gt;Named entity recognition and linking systems use statistical models trained over large amounts of labeled text data. A major challenge is to be able to accurately detect entities, in new languages, at scale, with limited labeled data available, and while consuming a limited amount of resources (memory and processing power).&lt;/p&gt;&lt;p&gt;After researching and implementing solutions to enhance our own personalization technology, we are pleased to offer the open source community &lt;a href="https://github.com/yahoo/FEL" target="_blank"&gt;Fast Entity Linker&lt;/a&gt;, our unsupervised, accurate, and extensible multilingual named entity recognition and linking system, along with &lt;a href="http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&amp;amp;did=81" target="_blank"&gt; datapacks&lt;/a&gt; for English, Spanish, and Chinese.&lt;/p&gt;&lt;p&gt;For broad usability, our system links text entity mentions to Wikipedia. For example, in the sentence &lt;i&gt;Yahoo is a company headquartered in Sunnyvale, CA with Marissa Mayer as CEO&lt;/i&gt;, our system would identify the following entities:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;i&gt;Yahoo&lt;/i&gt; – linked to &lt;a href="https://en.wikipedia.org/wiki/Yahoo" target="_blank"&gt;https://en.wikipedia.org/wiki/Yahoo!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;Sunnyvale, CA&lt;/i&gt; – linked to &lt;a href="https://en.wikipedia.org/wiki/Sunnyvale,_California" target="_blank"&gt;https://en.wikipedia.org/wiki/Sunnyvale,_California&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;i&gt;Marissa Mayer&lt;/i&gt; – linked to &lt;a href="https://en.wikipedia.org/wiki/Marissa_Mayer" target="_blank"&gt;https://en.wikipedia.org/wiki/Marissa_Mayer&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;On the algorithmic side, we use entity embeddings, click-log data, and efficient clustering methods to achieve high precision. The system achieves a low memory footprint and fast execution times by using compressed data structures and aggressive hashing functions.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Entity embeddings&lt;/b&gt; are vector-based representations that capture how entities are referred to in context. We train entity embeddings using Wikipedia articles, and use hyperlinked terms in the articles to create canonical entities. The context of an entity and the context of a token are modeled using the neural network architecture in the figure below, where entity vectors are trained to predict not only their surrounding entities but also the global context of word sequences contained within them. In this way, one layer models entity context, and the other layer models token context. We connect these two layers using the same technique that &lt;font face="Courier New"&gt;&lt;a href="http://www.jmlr.org/proceedings/papers/v32/le14.pdf" target="_blank"&gt;(Quoc and Mikolov ‘14)&lt;/a&gt;&lt;/font&gt; used to train paragraph vectors.&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="204" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/add772d0f1bc01afff7d39750be15ef5/tumblr_inline_ohr4eqjBeU1rgj0aw_540.png"&gt;&lt;img src="https://66.media.tumblr.com/c18ab3e48da5c28fbba9c101f4dd0c4a/tumblr_inline_p7k6i2ybu01rgj0aw_540.png" alt="image" data-orig-height="204" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/add772d0f1bc01afff7d39750be15ef5/tumblr_inline_ohr4eqjBeU1rgj0aw_540.png"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;&lt;i&gt;Architecture for training word embeddings and entity embeddings simultaneously.&lt;/i&gt; Ent &lt;i&gt;represents entities and&lt;/i&gt; W &lt;i&gt;represents their context words.&lt;/i&gt;&lt;/center&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;Search click-log data&lt;/b&gt; gives very useful signals to disambiguate partial or ambiguous entity mentions. For example, if searchers for “Fox” tend to click on “Fox News” rather than “20th Century Fox,” we can use this data in order to identify “Fox” in a document. To disambiguate entity mentions and ensure a document has a consistent set of entities, our system supports three entity disambiguation algorithms:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://ieeexplore.ieee.org/document/150435/" target="_blank"&gt;Forward Backward Algorithm &lt;font face="Courier New"&gt;(Austin et al. 91)&lt;/font&gt;&lt;/a&gt;*&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ieeexplore.ieee.org/document/4408853/" target="_blank"&gt;Exemplar Clustering &lt;font face="Courier New"&gt;(Frey and Dueck ‘07)&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://link.springer.com/chapter/10.1007%252F978-3-642-04174-7_29" target="_blank"&gt;Label Propagation &lt;font face="Courier New"&gt;(Talukdar and Crammer ‘09)&lt;/font&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;&lt;i&gt;*Currently, only the Forward Backward Algorithm is available in our open source release–the other two will be made available soon!&lt;/i&gt;&lt;/p&gt;&lt;p&gt;These algorithms are particularly helpful in accurately linking entities when a popular candidate is NOT the correct candidate for an entity mention. In the example below, these algorithms leverage the surrounding context to accurately link &lt;font face="Courier New"&gt;Manchester City, Swansea City, Liverpool, Chelsea, and Arsenal&lt;/font&gt; to their respective football clubs.&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="364" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/fae79977ac1c1fb9b8b819660fa44cb8/tumblr_inline_ohr4kymozw1rgj0aw_540.png"&gt;&lt;img src="https://66.media.tumblr.com/a5af81a0a03c7e1a86ac33f06fba68e2/tumblr_inline_p7k6i2WbbB1rgj0aw_540.png" alt="" data-orig-height="364" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/fae79977ac1c1fb9b8b819660fa44cb8/tumblr_inline_ohr4kymozw1rgj0aw_540.png"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;&lt;i&gt;Ambiguous mentions that could refer to multiple entities are highlighted in red. For example,&lt;/i&gt; Chelsea &lt;i&gt;could refer to Chelsea Football team or Chelsea neighborhood in New York or London. Unambiguous named entities are highlighted in green.&lt;/i&gt;&lt;/center&gt;&lt;p&gt;&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="351" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/226ad286968f3559e6b418a0e26604b4/tumblr_inline_ohr4ltWUzP1rgj0aw_540.png"&gt;&lt;img src="https://66.media.tumblr.com/0634faada22ea672d2cd86fe0e6dd678/tumblr_inline_p7k6i2lLKL1rgj0aw_540.png" alt="" data-orig-height="351" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/226ad286968f3559e6b418a0e26604b4/tumblr_inline_ohr4ltWUzP1rgj0aw_540.png"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;/p&gt;&lt;center&gt;&lt;i&gt;Examples of candidate retrieval process in Entity Linking for both ambiguous and unambiguous examples referred in the example above. The correct candidate is highlighted in green.&lt;/i&gt;&lt;/center&gt;&lt;br/&gt;At this time, Fast Entity Linker is one of only three freely-available multilingual named entity recognition and linking systems (others are DBpedia Spotlight and Babelfy). In addition to a stand-alone entity linker, the software includes tools for creating and compressing word/entity embeddings and datapacks for different languages from Wikipedia data. As an example, the datapack containing information from all of English Wikipedia is only ~2GB.&lt;p&gt;&lt;br/&gt;The technical contributions of this system are described in two scientific papers:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;Roi Blanco, Giuseppe Ottaviano, and Edgar Meij: “&lt;a href="https://research.yahoo.com/publications/6735/fast-and-space-efficient-entity-linking-queries" target="_blank"&gt;Fast and space-efficient entity linking in queries.&lt;/a&gt;” In Proceedings WDSM 2015.&lt;/li&gt;&lt;li&gt;Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani: “&lt;a href="https://research.yahoo.com/publications/8810/lightweight-multilingual-entity-extraction-and-linking" target="_blank"&gt;Lightweight multilingual entity extraction and linking.&lt;/a&gt;” In Proceedings WSDM 2017.&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;There are numerous possible applications of the open-source toolkit. One of them is attributing sentiment to entities detected in the text, as opposed to the entire text itself. For example, consider the following actual review of the movie “Inferno” from a user on &lt;a href="http://www.metacritic.com/movie/inferno" target="_blank"&gt;MetaCritic&lt;/a&gt; (revised for clarity): &lt;i&gt;“While the great performance of&lt;/i&gt; Tom Hanks (wiki_Tom_Hanks) &lt;i&gt;and company make for a mysterious and vivid movie, the plot is difficult to comprehend. Although the movie was a clever and fun ride, I expected more from&lt;/i&gt; Columbia (wiki_Columbia_Pictures)&lt;i&gt;.”&lt;/i&gt;  Though the review on balance is neutral, it conveys a positive sentiment about wiki_Tom_Hanks and a negative sentiment about wiki_Columbia_Pictures.&lt;/p&gt;&lt;p&gt;Many existing sentiment analysis tools collate the sentiment value associated with the text as a whole, which makes it difficult to track sentiment around any individual entity. With our toolkit, one could automatically extract “positive” and “negative” aspects within a given text, giving a clearer understanding of the sentiment surrounding its individual components.&lt;/p&gt;&lt;p&gt;Feel free to use the code, contribute to it, and come up with addtional applications; our system and models are available at &lt;a href="https://github.com/yahoo/FEL" target="_blank"&gt;https://github.com/yahoo/FEL&lt;/a&gt;.&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/154110423951</link><guid>https://yahooresearch.tumblr.com/post/154110423951</guid><pubDate>Mon, 05 Dec 2016 22:53:21 -0800</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Open Source</category><category>GitHub</category><category>Research</category><category>Entity Linking</category></item><item><title>10 Years of Hadoop and its Israeli Pioneering Researchers</title><description>&lt;p&gt;&lt;a href="https://yahooisrael.tumblr.com/post/153336432846/10-years-of-hadoop-and-its-israeli-pioneering" class="tumblr_blog" target="_blank"&gt;yahooisrael&lt;/a&gt;:&lt;/p&gt;&lt;blockquote&gt;
&lt;p&gt;&lt;figure class="tmblr-full" data-orig-height="360" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/42317c9a76e605b2368b35830c9c159f/tumblr_inline_ogt7giqQoR1rgj0aw_540.jpg"&gt;&lt;img src="https://66.media.tumblr.com/1e3516b830ce3ee4526a04f7229eabbd/tumblr_inline_ogtvyotQv01rgj0aw_540.jpg" data-orig-height="360" data-orig-width="540" data-orig-src="https://66.media.tumblr.com/42317c9a76e605b2368b35830c9c159f/tumblr_inline_ogt7giqQoR1rgj0aw_540.jpg"/&gt;&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/ebortnik" target="_blank"&gt;Edward Bortnikov&lt;/a&gt;&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;The Apache Hadoop technology suite is the engine behind the Big Data revolution that has been transforming multiple industries over the last decade. Hadoop was born at Yahoo 10 years ago as a pioneering open-source project. It quickly outgrew the company’s boundaries to become a vehicle that powers thousands of businesses ranging from small enterprises to Web giants. &lt;/p&gt;
&lt;p&gt;These days, Yahoo is the largest Hadoop deployment in the industry. We run tens of thousands of Hadoop machines in our datacenters and manage more than 600 petabytes of data. Our products use Hadoop in a variety of ways that reflect a wealth of data processing patterns. Yahoo’s infrastructure harnesses Hadoop Distributed File System (HDFS) for ultra-scalable storage, Hadoop MapReduce for massive ad-hoc batch processing, Hive and Pig for database-style analytics, HBase for key-value storage, Storm for stream processing, and Zookeeper for reliable coordination. &lt;/p&gt;
&lt;p&gt;Yahoo’s commitment to Hadoop goes far beyond operating the technology at Web scale. The company’s engineers and scientists make contributions to both entrenched and incubating Hadoop projects. Our Scalable Platforms team at Yahoo Research in Haifa has championed multiple innovative efforts that have benefited Yahoo products as well as the entire Hadoop community. Just recently, we contributed new algorithms to HBase, &lt;a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank"&gt;Omid&lt;/a&gt; (transaction processing system for HBase), and Zookeeper. Our work significantly boosted the performance of these systems and hardened their fault-tolerance. For example, the enhancements to Omid were instrumental for turning it into an &lt;a href="http://incubator.apache.org/projects/omid.html" target="_blank"&gt;Apache Incubation project&lt;/a&gt; (candidate for top-level technology status), whereas the work in HBase was named one of its top new features this year. &lt;/p&gt;
&lt;p&gt;Our team launched approximately three years ago. Collectively we add many years of experience to Yahoo and the Hadoop community in distributed computing research and development. We specialize in scalability and high availability, arguably the biggest challenges in big data platforms. We love to identify hard problems in large-scale systems, design algorithms to solve them, develop the code, experiment with it, and finally contribute to the community. The team features researchers with deep theoretical backgrounds as well as the engineering maturity required to deal with complex production code. Our researchers regularly present their innovations at leading industrial conferences (Hadoop Summit and HBaseCon), as well as at top academic venues. &lt;/p&gt;
&lt;p&gt;The researchers in our team comprise a blend of backgrounds in distributed computing, programming languages, and big systems, and most of us hold PhD degrees in these areas. We are especially proud to be a pioneering team of Hadoop developers in Israel. As such, we teach courses in big data technologies, organize technical meetups, and collaborate with academic colleagues. We are always happy to share our expertise with the ever-growing community of Hadoop users in the local hi-tech industry. &lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Our Scalable Platforms team in Haifa has been making significant contributions to the local research and engineering communities!&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/153394177336</link><guid>https://yahooresearch.tumblr.com/post/153394177336</guid><pubDate>Sat, 19 Nov 2016 11:31:14 -0800</pubDate><category>Yahoo</category><category>Yahoo Israel</category><category>Yahoo Research</category><category>research</category><category>Yahoo Hadoop</category><category>hadoop</category><category>engineering</category><category>Israel</category><category>Haifa</category></item><item><title>Open Sourcing a Deep Learning Solution for Detecting NSFW Images</title><description>&lt;p&gt;&lt;a class="tumblr_blog" href="http://yahooeng.tumblr.com/post/151148689421" target="_blank"&gt;yahooeng&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;By Jay Mahadeokar and Gerry Pesavento&lt;/p&gt;
&lt;p&gt;Automatically identifying that an image is not suitable/safe for work (NSFW), including offensive and adult images, is an important problem which researchers have been trying to tackle for decades. Since images and user-generated content dominate the Internet today, filtering NSFW images becomes an essential component of Web and mobile applications. With the evolution of computer vision, improved training data, and deep learning algorithms, computers are now able to automatically classify NSFW image content with greater precision.&lt;/p&gt;
&lt;p&gt;Defining NSFW material is subjective and the task of identifying these images is non-trivial. Moreover, what may be objectionable in one context can be suitable in another. For this reason, the model we describe below focuses only on one type of NSFW content: pornographic images. The identification of NSFW sketches, cartoons, text, images of graphic violence, or other types of unsuitable content is not addressed with this model.&lt;/p&gt;
&lt;p&gt;To the best of our knowledge, there is no open source model or algorithm for identifying NSFW images. In the spirit of collaboration and with the hope of advancing this endeavor, we are releasing our deep learning model that will allow developers to experiment with a classifier for NSFW detection, and provide feedback to us on ways to improve the classifier.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/yahoo/open_nsfw" target="_blank"&gt;Our general purpose Caffe deep neural network model (Github code)&lt;/a&gt; takes an image as input and outputs a probability (i.e a score between 0-1) which can be used to detect and filter NSFW images. Developers can use this score to filter images below a certain suitable threshold based on a &lt;a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank"&gt;ROC&lt;/a&gt; curve for specific use-cases, or use this signal to rank images in search results.&lt;/p&gt;
&lt;figure class="tmblr-full" data-orig-height="427" data-orig-width="1899"&gt;&lt;img src="https://66.media.tumblr.com/a24135a56ecf20d7efb81dda0f4ccbac/tumblr_inline_oebl0iNWRM1rilvr1_540.png" data-orig-height="427" data-orig-width="1899"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;b&gt;Convolutional Neural Network (CNN) architectures and tradeoffs&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;In recent years, CNNs have become very successful in image classification problems [1] [5] [6]. Since 2012, new CNN architectures have continuously improved the accuracy of the standard &lt;a href="http://image-net.org/challenges/LSVRC/2016/" target="_blank"&gt;ImageNet&lt;/a&gt; classification challenge. Some of the major breakthroughs include AlexNet (2012) [6], GoogLeNet [5], VGG (2013) [2] and Residual Networks (2015) [1]. These networks have different tradeoffs in terms of runtime, memory requirements, and accuracy. The main indicators for runtime and memory requirements are:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Flops or connections – The number of connections in a neural network determine the number of compute operations during a forward pass, which is proportional to the runtime of the network while classifying an image.&lt;/li&gt;
    &lt;li&gt;Parameters -–The number of parameters in a neural network determine the amount of memory needed to load the network.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Ideally we want a network with minimum flops and minimum parameters, which would achieve maximum accuracy.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;Training a deep neural network for NSFW classification&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;We train the models using a dataset of positive (i.e. NSFW) images and negative (i.e. SFW – suitable/safe for work) images. We are not releasing the training images or other details due to the nature of the data, but instead we open source the output model which can be used for classification by a developer.&lt;/p&gt;
&lt;p&gt;We use the &lt;a href="http://caffe.berkeleyvision.org/" target="_blank"&gt;Caffe&lt;/a&gt; deep learning library and &lt;a href="https://github.com/yahoo/CaffeOnSpark" target="_blank"&gt;CaffeOnSpark&lt;/a&gt;; the latter is a powerful open source framework for distributed learning that brings Caffe deep learning to Hadoop and Spark clusters for training models (Big shout out to Yahoo’s CaffeOnSpark team!).&lt;/p&gt;
&lt;p&gt;While training, the images were resized to 256x256 pixels, horizontally flipped for data augmentation, and randomly cropped to 224x224 pixels, and were then fed to the network. For training residual networks, we used scale augmentation as described in the ResNet paper [1], to avoid overfitting. We evaluated various architectures to experiment with tradeoffs of runtime vs accuracy.&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;MS_CTC [4] – This architecture was proposed in Microsoft’s constrained time cost paper. It improves on top of AlexNet in terms of speed and accuracy maintaining a combination of convolutional and fully-connected layers.&lt;/li&gt;
    &lt;li&gt;Squeezenet [3] – This architecture introduces the fire module which contain layers to squeeze and then expand the input data blob. This helps to save the number of parameters keeping the Imagenet accuracy as good as AlexNet, while the memory requirement is only 6MB.&lt;/li&gt;
    &lt;li&gt;VGG [2] – This architecture has 13 conv layers and 3 FC layers.&lt;/li&gt;
    &lt;li&gt;GoogLeNet [5] – GoogLeNet introduces inception modules and has 20 convolutional layer stages. It also uses hanging loss functions in intermediate layers to tackle the problem of diminishing gradients for deep networks.&lt;/li&gt;
    &lt;li&gt;ResNet-50 [1] – ResNets use shortcut connections to solve the problem of diminishing gradients. We used the 50-layer residual network released by the authors.&lt;/li&gt;
    &lt;li&gt;ResNet-50-thin – The model was generated using our &lt;a href="https://github.com/jay-mahadeokar/pynetbuilder" target="_blank"&gt;pynetbuilder&lt;/a&gt; tool and replicates the &lt;a href="https://arxiv.org/pdf/1512.03385v1.pdf" target="_blank"&gt;Residual Network&lt;/a&gt; paper’s 50-layer network (with half number of filters in each layer). You can find more details on how the model was generated and trained &lt;a href="https://github.com/jay-mahadeokar/pynetbuilder/tree/master/models/imagenet" target="_blank"&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;&lt;figure class="tmblr-full" data-orig-height="1673" data-orig-width="2548"&gt;&lt;img src="https://66.media.tumblr.com/c888335d4a41cd4bbaa31140a82d812e/tumblr_inline_oebl14F2zG1rilvr1_540.jpg" data-orig-height="1673" data-orig-width="2548"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;i&gt;Tradeoffs of different architectures: accuracy vs number of flops vs number of params in network.&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;The deep models were first pre-trained on the &lt;a href="http://image-net.org/" target="_blank"&gt;ImageNet&lt;/a&gt; 1000 class dataset. For each network, we replace the last layer (FC1000) with a 2-node fully-connected layer.  Then we fine-tune the weights on the NSFW dataset. Note that we keep the learning rate multiplier for the last FC layer 5 times the multiplier of other layers, which are being fine-tuned. We also tune the hyper parameters (step size, base learning rate) to optimize the performance.&lt;/p&gt;
&lt;p&gt;We observe that the performance of the models on NSFW classification tasks is related to the performance of the pre-trained model on ImageNet classification tasks, so if we have a better pretrained model, it helps in fine-tuned classification tasks. The graph below shows the relative performance on our held-out NSFW evaluation set. Please note that the false positive rate (FPR) at a fixed false negative rate (FNR) shown in the graph is specific to our evaluation dataset, and is shown here for illustrative purposes. To use the models for NSFW filtering, we suggest that you plot the ROC curve using your dataset and pick a suitable threshold.&lt;/p&gt;
&lt;figure class="tmblr-full" data-orig-height="1503" data-orig-width="2595"&gt;&lt;img src="https://66.media.tumblr.com/ed86cca7cc8b095299fead8f2afe0934/tumblr_inline_oebl1ggQH01rilvr1_540.jpg" data-orig-height="1503" data-orig-width="2595"/&gt;&lt;/figure&gt;&lt;p&gt;&lt;i&gt;Comparison of performance of models on Imagenet and their counterparts fine-tuned on NSFW dataset.&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;We are releasing the thin ResNet 50 model, since it provides good tradeoff in terms of accuracy, and the model is lightweight in terms of runtime (takes &amp;lt; 0.5 sec on CPU) and memory (~23 MB). Please refer our &lt;a href="https://github.com/yahoo/open_nsfw" target="_blank"&gt;git repository&lt;/a&gt; for instructions and usage of our model. We encourage developers to try the model for their NSFW filtering use cases. For any questions or feedback about performance of model, we encourage &lt;a href="https://github.com/yahoo/open_nsfw/issues" target="_blank"&gt;creating a issue&lt;/a&gt; and we will respond ASAP.&lt;/p&gt;
&lt;p&gt;Results can be improved by &lt;a href="http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html" target="_blank"&gt;fine-tuning&lt;/a&gt; the model for your dataset or use case. If you achieve improved performance or you have trained a NSFW model with different architecture, we encourage contributing to the model or sharing the link on our &lt;a href="https://github.com/yahoo/open_nsfw/blob/master/README.md" target="_blank"&gt;description&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;Disclaimer: The definition of NSFW is subjective and contextual. This model is a general purpose reference model, which can be used for the preliminary filtering of pornographic images. We do not provide guarantees of accuracy of output, rather we make this available for developers to explore and enhance as an open source project.&lt;/p&gt;
&lt;p&gt;We would like to thank &lt;a href="https://github.com/sachinfarfade/" target="_blank"&gt;Sachin Farfade&lt;/a&gt;, &lt;a href="https://github.com/amar-kamat" target="_blank"&gt;Amar Ramesh Kamat&lt;/a&gt;, &lt;a href="https://github.com/akappeler" target="_blank"&gt;Armin Kappeler&lt;/a&gt;, and Shraddha Advani for their contributions in this work.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;References&lt;/b&gt;:&lt;/p&gt;
&lt;p&gt;[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition” arXiv preprint arXiv:1512.03385 (2015).&lt;/p&gt;
&lt;p&gt;[2] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”; arXiv preprint arXiv:1409.1556(2014).&lt;/p&gt;
&lt;p&gt;[3] Iandola, Forrest N., Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 1MB model size.”; arXiv preprint arXiv:1602.07360 (2016).&lt;/p&gt;
&lt;p&gt;[4] He, Kaiming, and Jian Sun. “Convolutional neural networks at constrained time cost.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5353-5360. 2015.&lt;/p&gt;
&lt;p&gt;[5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9. 2015.&lt;/p&gt;
&lt;p&gt;[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems, pp. 1097-1105. 2012.&lt;/p&gt;
&lt;/blockquote&gt;</description><link>https://yahooresearch.tumblr.com/post/151345512356</link><guid>https://yahooresearch.tumblr.com/post/151345512356</guid><pubDate>Tue, 04 Oct 2016 11:51:26 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Deep Learning</category><category>NSFW</category><category>Engineering</category></item><item><title>A Rising Star in the Research Community</title><description>&lt;figure data-orig-width="245" data-orig-height="323"&gt;&lt;img src="https://66.media.tumblr.com/a79a367dd23a916e0925ac234ef4910d/tumblr_inline_oc4jq1eiRr1rgj0aw_540.png" data-orig-width="245" data-orig-height="323"/&gt;&lt;/figure&gt;&lt;p&gt;At Yahoo Research, we pride ourselves on our scientists’ significant contributions to Yahoo’s platforms and products as well as the external research community. One of our colleagues who exemplifies this balance the most is Senior Research Scientist &lt;a href="https://research.yahoo.com/researchers/bthomee" target="_blank"&gt;Bart Thomee&lt;/a&gt;. That’s why it came as no surprise to us when he recently received a prestigious award for his efforts in the field of multimedia computing. Bart is the latest recipient of the 2016 Rising Star Award presented by the Association for Computing Machinery (ACM) Special Interest Group on Multimedia (SIGMM).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;Bart sits with the Yahoo/Flickr team in San Francisco. His work centers on researching, designing and developing next-generation content-based multimedia search, browse, discovery and exploration techniques, with a particular focus on devising new geo technologies that are able to exploit the spatiotemporal nature of georeferenced media.&lt;/p&gt;&lt;p&gt;The &lt;a href="http://sigmm.org/news/sigmm_rising_star_award_2016" target="_blank"&gt;announcement&lt;/a&gt; notes, “The ACM Special Interest Group on Multimedia (SIGMM) is pleased to present this year’s Rising Star Award in multimedia computing, communications and applications to Dr. Bart Thomee for his significant contributions in the areas of geo-multimedia computing, media evaluation and open research datasets.”&lt;/p&gt;&lt;p&gt;The award “recognizes a young researcher who has made outstanding research contributions to the field of multimedia computing, communication and applications during the early part of his or her career.”&lt;/p&gt;&lt;p&gt;We’re very proud of Bart and thank him for all of his hard work.&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/149144175606</link><guid>https://yahooresearch.tumblr.com/post/149144175606</guid><pubDate>Thu, 18 Aug 2016 14:31:49 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>ACM</category><category>SIGMM</category><category>Flickr</category><category>MultiMedia</category></item><item><title>Creating Animated GIFs Automatically from Video</title><description>&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/yalesong" target="_blank"&gt;Yale Song&lt;/a&gt;, &lt;a href="https://research.yahoo.com/researchers/liangliang" target="_blank"&gt;Liangliang Cao&lt;/a&gt;, and Michael Gygli&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;With the rise in popularity of mobile phones and action cameras (e.g. GoPro cameras), video capture has become cheap and omnipresent. Vast amounts of videos are recorded everyday to capture special moments and log daily activities. At the same time, animated GIFs, first introduced in the late 80s, have gained huge popularity on social networks such as Tumblr and Reddit. This is not a coincidence: the abundance of user-generated video content calls for ways to easily edit videos and make them more accessible. &lt;/p&gt;&lt;p&gt;Toward this goal, various websites – like Tumblr, GIFSoup, Imgflip, and Ezgif – offer easy-to-use tools to manually generate GIFs from portions of a given video. At Yahoo Research, in collaboration with ETH Zurich, we’ve gone one step further by developing a system that &lt;b&gt;automatically&lt;/b&gt; generates animated GIFs from the most &amp;ldquo;GIFable&amp;rdquo; segments of a video. &lt;/p&gt;&lt;p&gt;Try it out for yourself on our demo website: &lt;a href="http://video2gif.info/autogif" target="_blank"&gt;http://video2gif.info/autogif&lt;/a&gt;. Simply provide a YouTube link and our system will automatically select the best parts of the video and convert them into GIFs.&lt;br/&gt;&lt;/p&gt;&lt;figure class="tmblr-full" data-orig-height="342" data-orig-width="800"&gt;&lt;img src="https://66.media.tumblr.com/20f0809c98ef5f5a22bbc62f54c09575/tumblr_inline_oaxgiyDDxi1rgj0aw_540.png" data-orig-height="342" data-orig-width="800" alt="image"/&gt;&lt;/figure&gt;&lt;p&gt;Automatically determining which portions of a video are the most “GIFable” is a challenging research problem. By GIFable, we mean a segment of a video that our algorithm chooses as the most-likely candidate for a GIF based on our learnings from what we believe to be the largest dataset in the related field. We detail our method and our dataset in our recently-published research paper, “&lt;a href="https://research.yahoo.com/publications/8723/video2gif-automatic-generation-animated-gifs-video" target="_blank"&gt;Video2GIF: Automatic Generation of Animated GIFs from Video&lt;/a&gt;,” that appeared in the proceedings of the 29th &lt;a href="http://cvpr2016.thecvf.com/" target="_blank"&gt;IEEE Conference on Computer Vision and Pattern Recognition&lt;/a&gt; (CVPR 2016).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;To explain the main points of our research work, let us briefly describe how we’ve built our system. &lt;/p&gt;&lt;p&gt;A naive approach would be to define a list of GIFable moments (e.g., “cat runs into a box”) and write an algorithm that detects those moments from a video. This approach might work well for a limited set of moments. As the list grows though, defining and maintaining such a list would certainly not scale – think the variety of video content on the Internet! At Yahoo Research, scalability is at the heart of our scientific endeavors; We needed a better solution that could handle the variety of video content.&lt;br/&gt;&lt;/p&gt;&lt;center&gt;&lt;img class="ylabs-img-main" src="https://s.yimg.com/ge/research/GIF2.gif" width="180" height="125" alt="image"/&gt;&lt;img class="ylabs-img-main" src="https://s.yimg.com/ge/research/GIF3.gif" width="180" height="125" alt="image"/&gt;&lt;img class="ylabs-img-main" src="https://s.yimg.com/ge/research/GIF4.gif" width="180" height="125" alt="image"/&gt;&lt;/center&gt;&lt;center&gt;&lt;address dir="ltr"&gt;Animated GIFs cover a large variety of content, which makes automatically detecting them challenging.&lt;/address&gt;&lt;/center&gt;&lt;br/&gt;&lt;p&gt;To this end, we turned to an approach that learns from the vast amount of animated GIFs available on the Internet. We specifically focused on only those GIFs that have links to their source videos, such as the one from &lt;a href="http://gifsboom.net/post/144906884177/desert-iguana-video" target="_blank"&gt;this Tumblr post&lt;/a&gt;. This allowed us to collect what the machine learning community calls a “weakly-labeled” dataset – the fact that some portions of a video are selected as animated GIFs gives us a “weak” signal that those portions are more GIFable than other portions from the same video.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;To date, our dataset has more than 100,000 pairs of animated GIFs and their corresponding videos. We recently made our dataset &lt;a href="https://github.com/gyglim/video2gif_dataset" target="_blank"&gt;publicly available&lt;/a&gt; to the academic community. To the best of our knowledge, this is the largest dataset available in the related fields of video highlight detection and summarization.&lt;/p&gt;&lt;p&gt;On the technology front, we developed our system using advanced techniques from computer vision and deep learning; The most important part is a &lt;a href="https://en.wikipedia.org/wiki/Convolutional_neural_network" target="_blank"&gt;convolutional neural network&lt;/a&gt;, a class of models composed of multiple layers where each layer extracts increasingly high-level information from the previous layer. These networks can be trained in an end-to-end fashion with labeled examples: the network takes as an input an image or a short video segment, reads them in the form of pixel values, then successively transforms the information into a semantic understanding of what is shown in an image on the highest layer so that it produces an output value that is similar to the given label (for illustration, see page 5 in &lt;a href="https://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf" target="_blank"&gt;these NIPS 2010 tutorial slides&lt;/a&gt;). &lt;/p&gt;&lt;p&gt;The core innovation in our approach is the within-video &lt;a href="https://en.wikipedia.org/wiki/Learning_to_rank" target="_blank"&gt;ranking formulation&lt;/a&gt; of the problem. In short, we constructed a convolutional neural network and trained it in such a way that the network outputs a higher score to a GIF segment than a non-GIF segment, where both segments come from the same video. The rationale behind this is as follows: the fact that someone selected a segment to be an animated GIF doesn’t necessarily mean that other segments are clearly bad – it merely suggests that the selected segment is comparably better (in whatever sense) than other segments of the same video. Also, comparing segments is meaningful only within the context of a video – we cannot treat GIF segments from, say, a cat compilation video and a baseball game video the same, because each video has different notion of “interestingness.” &lt;/p&gt;&lt;p&gt;Using our dataset, we trained our neural network using over 500,000 pairs of GIF/non-GIF segments, each coming from the same video, so that it learned to assign a higher score to a GIFable segment. Thus, after having learned such a model, we can now take a new video, split it into parts, and present each part to the model. The model will then provide a score telling us how suitable the part is as an animated GIF (in other words, how GIFable it is). Finally, we select a few top-scoring parts and automatically convert them into GIFs (this is exactly what our Web demo does). &lt;/p&gt;&lt;p&gt;Try out our &lt;a href="http://video2gif.info/autogif" target="_blank"&gt;demo&lt;/a&gt; showcasing our technique and give us any feedback you have. Happy GIF-ing!&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/148009705216</link><guid>https://yahooresearch.tumblr.com/post/148009705216</guid><pubDate>Tue, 26 Jul 2016 12:19:06 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>GIF</category><category>GIFs</category><category>video</category><category>machine learning</category><category>computer vision</category><category>IEEE</category><category>demo</category></item><item><title>Open Sourcing SparkADMM: 
a Massively-parallel Framework for Solving Big Data Problems</title><description>&lt;a href="https://s.yimg.com/ge/research/SparkADMM.png" target="_blank"&gt;&lt;/a&gt;&lt;a href="https://s.yimg.com/ge/research/SparkADMM.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="370" data-orig-width="1372" data-orig-src="https://s.yimg.com/ge/research/SparkADMM.png"&gt;&lt;img src="https://66.media.tumblr.com/9e1c95bf26a85ba5d85458108c49a38a/tumblr_inline_p8wvh4Ua1h1rgj0aw_540.png" alt="image" data-orig-height="370" data-orig-width="1372" data-orig-src="https://s.yimg.com/ge/research/SparkADMM.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/nlaptev" target="_blank"&gt;Nikolay Laptev&lt;/a&gt; and Stratis Ioannidis&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Training machine learning models over massive amounts of data is a cornerstone of many data analytics tasks. Usually this involves solving large optimization problems involving millions of optimization variables and constraints. Doing so over a parallel platform, like &lt;a href="http://spark.apache.org/" target="_blank"&gt;Spark&lt;/a&gt; or &lt;a href="http://hadoop.apache.org/" target="_blank"&gt;Hadoop&lt;/a&gt;, is crucial to making such computations scalable. &lt;/p&gt;&lt;p&gt;It is not always obvious how to solve large optimization problems in parallel. &lt;a href="http://stanford.edu/~boyd/admm.html" target="_blank"&gt;ADMM&lt;/a&gt;, which stands for the Alternating Directions Method of Multipliers, is a popular parallel optimization technique that provides a methodology for doing so. It permits the parallelization of a broad array of several important machine learning tasks, such as regression and classification, in a massively parallel fashion. For example, to train a classifier using ADMM over a very large dataset, a developer first splits the dataset and partitions it across multiple machines. A classifier is trained on each machine, based on the locally-stored portion of the dataset. Then, a global classifier learned from the entire dataset is extracted through &lt;i&gt;consensus&lt;/i&gt;; ADMM averages out these classifiers and repeats the process through several iterations, forcing the local computations to be closer to the consensus value each time. This way, after several iterations, ADMM constructs a “consensus” classifier, which provably fits the entire dataset.&lt;/p&gt;&lt;p&gt;ADMM’s strength lies in its generality: it gives a template on how to take any serial machine learning algorithm designed to operate locally on a single dataset, and parallelize its execution over thousands of machines. To that end, Yahoo Research is publishing &lt;a href="https://github.com/yahoo/SparkADMM" target="_blank"&gt;SparkADMM&lt;/a&gt; – a massively parallel abstract programming framework for solving big data optimization problems through ADMM over Spark – to the Open Source community. The implementation, developed by Yahoo researchers Stratis Ioannidis, Yunjiang Jiang, Nikolay Laptev, and Saeed Amizadeh, allows for the quick deployment of ADMM solvers over Spark, without any prior knowledge about ADMM for consensus.&lt;/p&gt;&lt;p&gt;The released code provides an extensible interface through which developers in the community can deploy and execute ADMM on arbitrary optimization problems of their choice over massive amounts of data. It allows developers to easily scale and parallelize any model training method they have devised and optimized to run over a single processor. To do so, a developer need only specify how to train a model over local data. SparkADMM takes care of executing the code in parallel over Spark, and reaching a consensus value. This makes the code modular, and allows a developer to quickly scale serial code written for small datasets, to datasets of several terabytes in size.&lt;/p&gt;&lt;p&gt;Yahoo researchers have demonstrated the flexibility of this code by training traffic forecasting models through SparkADMM. Their work, entitled “&lt;a href="https://research.yahoo.com/publications/8772/parallel-news-article-traffic-forecasting-admm" target="_blank"&gt;Parallel News-Article Traffic Forecasting with ADMM&lt;/a&gt;,” will appear in the proceedings of the &lt;a href="http://www-bcf.usc.edu/~liu32/milets16/" target="_blank"&gt;2nd International Workshop on Mining and Learning  from Time Series&lt;/a&gt; (MiLeTS), held jointly with the &lt;a href="http://www.kdd.org/kdd2016/" target="_blank"&gt;ACM Conference on Knowledge Discovery and Data Mining&lt;/a&gt; (KDD), August 13-17th in San Francisco, CA. The release comes with a definition of the abstract class developers need to instantiate in order to train their models locally, as well as several example implementations. Making this available to the community aims to give everyone the opportunity to grow the library of serial tasks that are massively parallelized through ADMM.&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/147013834176</link><guid>https://yahooresearch.tumblr.com/post/147013834176</guid><pubDate>Wed, 06 Jul 2016 15:16:12 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Hadoop</category><category>Spark</category><category>ADMM</category><category>open source</category><category>big data</category><category>machine learning</category><category>optimization</category><category>github</category></item><item><title>Science Powering Product: Large-scale Query-to-Ad Matching in Sponsored Search</title><description>&lt;p&gt;By &lt;a href="https://labs.yahoo.com/researchers/mihajlo" target="_blank"&gt;Mihajlo Grbovic&lt;/a&gt;, Vladan Radosavljevic, Nemanja Djuric, Andy Feng, Erik Ordentlich, and Lee Yang&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;Sponsored search represents a major source of revenue for search engines on the Web. This popular advertising model brings a unique possibility for advertisers to target users’ immediate intent communicated through a search query. Usually this is done by displaying ads alongside organic search results for queries deemed relevant to an advertiser’s products or services. In a typical ad-booking scenario, the advertiser provides the search engine with their ad title and description, along with a list of bid terms (i.e., queries against which they want their ads to run). However, due to a large number of unique queries it is challenging for advertisers to identify all relevant bid terms. For this reason, search engines often provide a service of advanced matching, which automatically finds additional relevant queries for advertisers to bid on. &lt;/p&gt;&lt;p&gt;In a new research paper entitled “&lt;a href="https://research.yahoo.com/publications/8758/scalable-semantic-matching-queries-ads-sponsored-search-advertising" target="_blank"&gt;Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising&lt;/a&gt;,” published in the proceedings of the upcoming &lt;a href="http://sigir.org/sigir2016" target="_blank"&gt;39th International ACM SIGIR Conference&lt;/a&gt;, we present a novel advanced matching approach based on the idea of &lt;a href="http://arxiv.org/abs/1310.4546" target="_blank"&gt;semantic embeddings&lt;/a&gt; that was recently launched at full scale on Yahoo’s Sponsored Search platform.&lt;/p&gt;&lt;p&gt;Traditionally, matches between queries and ads are found based on the level of textual similarity between the query text and the ad title text. In our approach, instead of forming the so-called bag-of-words vector representations of queries and ads, we propose learning vector representations using a large dataset of search sessions, such that ads and relevant queries would be close in the vector space. This makes it easy for a computer to match queries to ads based on similarities between the learned vectors.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Large-scale Training Algorithm&lt;/b&gt;&lt;/p&gt;&lt;p&gt;To train the query and ad vectors we used the largest in-house Yahoo Search dataset thus far, comprising over &lt;b&gt;9 billion search sessions&lt;/b&gt;. As illustrated in Figure 1 (left), a search session is defined as an uninterrupted sequence of user actions comprising queries (marked in blue), ad clicks (marked in orange), and search link clicks (marked in green).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure11.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure11.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="816" data-orig-width="2069" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure11.png"&gt;&lt;img src="https://66.media.tumblr.com/9066f61ae9df631c7907544b2bc59f38/tumblr_inline_p8wvh5jIMt1rgj0aw_540.png" alt="image" data-orig-height="816" data-orig-width="2069" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure11.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;Once a search session dataset is created, we then learn a vector representation for each unique query, ad id, and search link click id by leveraging surrounding context. Specifically, queries, ads, and links from the same session are used as positive signals, and ads that are skipped in favor of a click on a lower-positioned ad are seen as negative signals. When updating the vectors we also found it useful to utilize dwell-time, i.e. time spent on the landing page post click, such that we can distinguish between good clicks and unsatisfactory or accidental clicks. The graphical representation of our just-described &lt;b&gt;search2vec&lt;/b&gt; model is illustrated in Figure 1 (right).&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;To realize the full potential of the proposed approach, we found it necessary to train vectors for several hundred million queries, ads, and links. Existing implementations for training embeddings were found to fall short for our vocabulary size target, as they require that all of the vectors fit in the memory of a single machine. To address this issue, we developed a novel distributed embedding training algorithm (Figure 2) based on the &lt;a href="https://www.cs.cmu.edu/~muli/file/ps.pdf" target="_blank"&gt;parameter server paradigm&lt;/a&gt;.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure12.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure12.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="1015" data-orig-width="2078" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure12.png"&gt;&lt;img src="https://66.media.tumblr.com/fd1bb7f47e2a6e87fedf0da9bfbed483/tumblr_inline_p8wvh63GKo1rgj0aw_540.png" alt="image" data-orig-height="1015" data-orig-width="2078" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure12.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;Our key innovations and findings were that the column-wise partitioning of vectors among parameter server (PS) shards and server-side computation of vector-dot products greatly reduces network bandwidth requirements relative to the conventional parameter server approach.&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;We implemented our search2vec system in Java and Scala on a &lt;a href="http://hadoop.apache.org" target="_blank"&gt;Hadoop YARN-scheduled cluster&lt;/a&gt;, leveraging &lt;a href="https://slider.incubator.apache.org" target="_blank"&gt;Slider&lt;/a&gt; and &lt;a href="https://spark.apache.org" target="_blank"&gt;Spark&lt;/a&gt;. The system allowed us to train embeddings for more than &lt;b&gt;126 million unique queries, 43 million unique ads, and 132 million unique links&lt;/b&gt;, which is a &lt;b&gt;5x increase&lt;/b&gt; compared to a single machine implementation.&lt;/p&gt;&lt;p&gt;&lt;b&gt;Finding the Best Queries for an Ad&lt;/b&gt;&lt;/p&gt;&lt;p&gt;After learning the query and ad vectors, finding the best queries for a specific ad becomes a matter of calculating a cosine similarity between the ad vector and all query vectors, and identifying queries with the highest cosine similarity value. As demonstrated in Figure 3, this allows us to find relevant queries for any ad without ever looking into an ad title and description.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure13.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure13.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="899" data-orig-width="2092" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure13.png"&gt;&lt;img src="https://66.media.tumblr.com/2e61af4fdb01b1415623d81113c29a36/tumblr_inline_p8wvh63Amg1rgj0aw_540.png" alt="image" data-orig-height="899" data-orig-width="2092" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure13.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;To further illustrate the quality of our query vectors, in the demonstration video below we show how resulting query vectors can be manipulated to find similar queries to an input query. This process is known as query rewriting.&lt;br/&gt;&lt;/p&gt;&lt;br/&gt;&lt;center&gt;&lt;iframe width="640" height="360" scrolling="no" frameborder="0" src="https://external.global.media.yahoo.com/video/search2vec-query-rewriting-application-125806768.html?format=embed" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;&lt;/center&gt;&lt;br/&gt;&lt;b&gt;Experiments and Product Impact&lt;br/&gt;&lt;/b&gt;&lt;br/&gt;The launch of our new matching algorithm was a yearlong effort by a team of research scientists and engineers. In order to make sure the algorithm would function correctly, we conducted a series of offline and online tests.First, we asked humans to eyeball some matches between the queries and ads, and at the same time, used search2vec to calculate the cosine similarity between their corresponding vectors. We wanted the matching algorithm to ideally agree with human judgments; low similarity was assigned to matches that humans judged as bad, and high similarity matches to the ones judged by humans as good. Moreover, given a query we wanted our model to rank the ads with good grades higher than the ads with bad grades, which we evaluated using an &lt;a href="https://en.wikipedia.org/wiki/Discounted_cumulative_gain" target="_blank"&gt;NDCG measure&lt;/a&gt;.&lt;br/&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure14.png" target="_blank"&gt;&lt;/a&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Figure14.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="992" data-orig-width="2076" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure14.png"&gt;&lt;img src="https://66.media.tumblr.com/a27aec4c64c7ad1fc6c1c9bd54f221a5/tumblr_inline_p8wvh7IYP91rgj0aw_540.png" alt="image" data-orig-height="992" data-orig-width="2076" data-orig-src="https://s.yimg.com/ge/research/search2vec/Figure14.png"/&gt;&lt;/figure&gt;&lt;/a&gt;The results summarized in Figure 4 (left), show that with the exception of several outliers, the search2vec algorithm does a great job in distinguishing between different classes of human judgments. In addition, as Figure 4 (right) shows, compared to several baseline algorithms, search2vec does a better job in ranking ads.&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;br/&gt;Second, before launching our new matching algorithm at full scale, we performed online performance experiments in form of A/B tests. The first test was conducted one year ago and it involved testing a dictionary of query-ad matches produced by a single machine search2vec model that could scale up to 60 million vectors. When compared to the production model that did not contain this additional dictionary, &lt;b&gt;we observed a 7% increase in revenue per search&lt;/b&gt;. This lead to the successful deployment of our first search2vec model.&lt;br/&gt;&lt;br/&gt;In the months ahead we worked on scaling up the size of our vocabulary by implementing parameter server-based distributed training. It allowed us to train more than 300 million vectors. Our second A/B test showed that our distributed search2vec model could &lt;b&gt;achieve an additional 9.4% increase in revenue per search&lt;/b&gt; compared to the production model that already had the single machine dictionary.&lt;br/&gt;&lt;br/&gt;The results from both A/B tests (Table 1) also showed that as we increased query coverage and auction depth, we did not hurt user experience since click-through-rates (CTR) remained flat and even slightly positive.&lt;br/&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Table11.png" target="_blank"&gt;&lt;/a&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Table11.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="683" data-orig-width="2085" data-orig-src="https://s.yimg.com/ge/research/search2vec/Table11.png"&gt;&lt;img src="https://66.media.tumblr.com/dc79582204709750069e7a35a6144340/tumblr_inline_p8wvh8jlU61rgj0aw_540.png" alt="image" data-orig-height="683" data-orig-width="2085" data-orig-src="https://s.yimg.com/ge/research/search2vec/Table11.png"/&gt;&lt;/figure&gt;&lt;/a&gt;Following successful A/B tests, our distributed training model was launched in production. Today, search2vec is being re-trained on a regular basis and accounts for &lt;b&gt;more than 30% of all broad match impressions and revenue&lt;/b&gt; on Yahoo Search.&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;br/&gt;&lt;b&gt;Contributing Query Embeddings to the Research Community&lt;/b&gt;&lt;br/&gt;&lt;br/&gt;&lt;p&gt;As part of this research, we took &lt;a href="http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&amp;amp;did=73" target="_blank"&gt;8M query vectors&lt;/a&gt; trained using our search2vec system, and made them available to researchers via our Yahoo &lt;a href="http://webscope.sandbox.yahoo.com/" target="_blank"&gt;Webscope&lt;/a&gt; data-sharing program. The vectors may serve as a testbed for query rewriting tasks, or word and sentence similarity tasks, which are common problems in NLP research. We would like for researchers to be able to produce query rewrites based on these vectors and test them against other state-of-the-art techniques. In addition, we provide an editorially-judged set of 4016 query rewrites, on which we compared our search2vec performance against the &lt;a href="http://arxiv.org/abs/1310.4546" target="_blank"&gt;word2vec model&lt;/a&gt;, the results of which are summarized below.&lt;br/&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Table12.png" target="_blank"&gt;&lt;/a&gt;&lt;a href="https://s.yimg.com/ge/research/search2vec/Table12.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="569" data-orig-width="2087" data-orig-src="https://s.yimg.com/ge/research/search2vec/Table12.png"&gt;&lt;img src="https://66.media.tumblr.com/af64606c0bf47d4eb558954fd8bdf402/tumblr_inline_p8wvh8AxGZ1rgj0aw_540.png" alt="image" data-orig-height="569" data-orig-width="2087" data-orig-src="https://s.yimg.com/ge/research/search2vec/Table12.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/146257394201</link><guid>https://yahooresearch.tumblr.com/post/146257394201</guid><pubDate>Tue, 21 Jun 2016 07:14:43 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Machine Learning</category><category>advertising</category><category>Advertising Sciences</category><category>Search</category><category>Sponsored Search</category><category>sigir2016</category><category>research</category><category>science</category></item><item><title>Novel Modeling of Syntactic Parsing for Web Queries</title><description>&lt;a data-flickr-embed="true" data-header="true" data-footer="true" href="https://www.flickr.com/photos/yahooresearch/27395889220/in/datetaken-public/" title="Research Engineer Yuval Pinter presenting at NAACL2016" target="_blank"&gt;&lt;img src="https://c5.staticflickr.com/8/7073/27395889220_fdec32ec0e_z.jpg" width="640" height="360" alt="Research Engineer Yuval Pinter presenting at NAACL2016"/&gt;&lt;/a&gt;&lt;script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"&gt;&lt;/script&gt;&lt;p&gt;&lt;br/&gt;By &lt;a href="https://research.yahoo.com/researchers/yuvalp" target="_blank"&gt;Yuval Pinter&lt;/a&gt;, &lt;a href="http://ie.technion.ac.il/~roiri/" target="_blank"&gt;Roi Reichart&lt;/a&gt;, and &lt;a href="https://research.yahoo.com/researchers/idan" target="_blank"&gt;Idan Szpektor&lt;/a&gt;&lt;/p&gt;&lt;p&gt;Most of the content on the Web is in the form of natural language text, such as news articles, blogs, tweets, and queries to search engines. In order to improve the ability of computers to perform tasks related to these texts, including translation, summarization, recommendation, and search, researchers develop algorithms, models, and tools executed automatically by computers for Natural Language Processing (NLP). Yet, current state-of-the-art NLP models are still not fit to process the variety of text styles that appear across the Web. In an effort to expand the utility of NLP tools for Web content, in this blog post we discuss novel modeling of syntactic parsing for Web queries.&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-01.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-01.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="670" data-orig-width="1774" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-01.png"&gt;&lt;img src="https://66.media.tumblr.com/28af819bb89cf6a92b6b53b9ff0285e2/tumblr_inline_p96f84kHl81rgj0aw_540.png" alt="image" data-orig-height="670" data-orig-width="1774" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-01.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;One important building block of automatic text processing for NLP is syntactic parsers: automatic systems for discovering the syntactic connections between words in texts. An example of the output of syntactic parsing is shown above for the sentence “the cat ate food:” &lt;i&gt;cat&lt;/i&gt; is the &lt;u&gt;subject&lt;/u&gt; of the verb ate, &lt;i&gt;food&lt;/i&gt; is the &lt;u&gt;direct object&lt;/u&gt;, and &lt;i&gt;the&lt;/i&gt; serves as a &lt;u&gt;determiner&lt;/u&gt; for the word &lt;i&gt;cat&lt;/i&gt;. These relations form the syntactic &lt;b&gt;parse tree&lt;/b&gt; for the given sentence. Syntactic parsers are usually trained to handle grammatical English sentences, such as those written in editorially-supervised news articles. Therefore, the performance of off-the-shelf syntactic parsers drops when applied to texts that are not “well written.” Yet, most of the texts encountered on the Web are generated not by professional authors, but rather by ordinary users. A very common type of these texts is Web search queries issued to search engines. In theory, syntactic parsers can be applied to Web search queries to aid in determining the best search results based on the relationship between words in the query. They can also find entities in the query, help a user reformulate a query, and more. However, search queries often do not follow the grammatical conventions of standard English, and many resemble keyword-keyphrase bundles, e.g. “weather chicago,” “best hotel where to find new york,” and “why is the sky blue yahoo answers.”&lt;/p&gt;&lt;p&gt;When faced with the challenge of parsing a given Web query, two problems need to be addressed:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;What is the correct parse structure for this query?&lt;/li&gt;&lt;li&gt;How can an automatic parser learn to output this structure?&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;In a research &lt;a href="https://research.yahoo.com/publications/8709/syntactic-parsing-web-queries-question-intent" target="_blank"&gt;paper&lt;/a&gt; published in the proceedings of this week’s &lt;a href="http://naacl.org/naacl-hlt-2016/" target="_blank"&gt;15th Annual Conference&lt;/a&gt; of the&lt;a href="http://www.naacl.org/" target="_blank"&gt; North American Chapter&lt;/a&gt; of the&lt;a href="http://www.aclweb.org/" target="_blank"&gt; Association for Computational Linguistics&lt;/a&gt;: Human Language Technologies (NAACL-HLT), we propose solutions for both these problems. First, in answer to problem (1), we define a grammar structure that suits these keyphrase-bundles based on the notion of a &lt;b&gt;syntactic segment&lt;/b&gt;: a word sequence that can be represented using a single parse tree, but is not syntactically connected to the other parts of the query. In our proposed grammar structure, each query is split into syntactic segments, and each segment is represented by a (fully connected) parse tree. An example:&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-02.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-02.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="596" data-orig-width="1434" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-02.png"&gt;&lt;img src="https://66.media.tumblr.com/b3d548941599f5f5678d8231f563c81e/tumblr_inline_p96f856zVN1rgj0aw_540.png" alt="image" data-orig-height="596" data-orig-width="1434" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-02.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;In addition to defining this grammar structure, we also released a 5,000-&lt;b&gt;query treebank&lt;/b&gt;: a dataset which we mined from Web queries landing on Yahoo Answers and manually tagged for segmentation and parsing for the benefit of the research community (via the &lt;a href="https://webscope.sandbox.yahoo.com/" target="_blank"&gt;Yahoo Webscope&lt;/a&gt; program).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;After running a number of state-of-the-art parsers on this data, we verified that systems trained for grammatical English perform poorly. For example, in the query “invent toy school project,” which corresponds to the need “invent a toy for a school project,” an out-of-the-box parser thinks the word “toy” modifies the word “project”, where in fact it does not – they are parts of different segments!&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-03.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-03.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="842" data-orig-width="1168" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-03.png"&gt;&lt;img src="https://66.media.tumblr.com/a3be6b17fc5f4d3446a29d52e0b8fe5c/tumblr_inline_p96f851sG71rgj0aw_540.png" alt="image" data-orig-height="842" data-orig-width="1168" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-03.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;Next, we turn to problem (2), of how to train a new system that can handle our segmentation-enriched grammar structure. We propose three approaches that we implemented and compared in our paper. First, we designed a two-step procedure of learning the segmentation of a query and later applying an off-the-shelf parser to each segment found in the first phase. For segmentation, a standard approach is to use &lt;b&gt;supervised learning&lt;/b&gt;, where a &lt;a href="https://en.wikipedia.org/wiki/Conditional_random_field" target="_blank"&gt;CRF&lt;/a&gt; system for detecting segments is trained on the queries in our treebank.&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;We next proposed two methods based on the concept of &lt;b&gt;distant supervision&lt;/b&gt;, where the training examples are not manually-tagged queries, but automatically derived. The advantage of using distant supervision over supervised learning is that no need for long, arduous, and sometimes error-prone, human labor is required for training a machine learning algorithm. Specifically, we utilized millions of queries for which the searchers clicked on pages in Yahoo Answers. The question in the clicked Yahoo Answers page, which is typically a well-formed sentence, expresses the same request as the corresponding query, and often contains many of its words. We call a pair consisting of a query and a corresponding clicked question, an &lt;b&gt;aligned pair&lt;/b&gt;.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-04.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-04.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="1112" data-orig-width="1386" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-04.png"&gt;&lt;img src="https://66.media.tumblr.com/c50b228f1b6c34d794f5cf6fd7383e3f/tumblr_inline_p96f86Gt2l1rgj0aw_540.png" alt="image" data-orig-height="1112" data-orig-width="1386" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-04.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;Knowing this, we defined several rules for automatically detecting segment boundaries in queries based on their aligned questions. Examples for these rules include word reordering (i.e. words that switch positions between the query and question) and connectives (such as forms of the verb be) that appear in the question but are missing from the query.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-05.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-05.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="324" data-orig-width="1672" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-05.png"&gt;&lt;img src="https://66.media.tumblr.com/bf5e9a2e0375a7a1b4f6e35722669e87/tumblr_inline_p96f876ig11rgj0aw_540.png" alt="image" data-orig-height="324" data-orig-width="1672" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-05.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;br/&gt;&lt;p&gt;Given the automatically-segmented queries, we train a CRF system similar to the supervised approach above.&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;In our second distant supervision approach, we apply a query-to-question (Q2Q) algorithm, which converts an input query into a full question. This algorithm utilized the aligned pairs between queries and clicked questions to learn templates that perform this conversion (see details in: &lt;a href="http://www2013.org/proceedings/p391.pdf" target="_blank"&gt;Dror et al. 2013. From query to question in one click: suggesting synthetic questions to searchers&lt;/a&gt;).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;Given a new query, our parsing algorithm:&lt;/p&gt;&lt;ol type="i"&gt;&lt;li&gt;converts the query into a complete question using the Q2Q algorithm&lt;/li&gt;&lt;li&gt;parses the generated question using an off-the-shelf parser&lt;/li&gt;&lt;li&gt;projects the parse tree of the question onto the query, detecting segments based on words along the question’s parse tree that are missing from the query&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-06.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-06.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="742" data-orig-width="1628" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-06.png"&gt;&lt;img src="https://66.media.tumblr.com/feea9a453afd967e777c06702ab066f5/tumblr_inline_p96f87syZy1rgj0aw_540.png" alt="image" data-orig-height="742" data-orig-width="1628" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-06.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;br/&gt;&lt;p&gt;We compared the performance of the three approaches (as well as a no-segmentation baseline) against the query treebank. When it comes to the segmentation task on its own (which we measure in F1 score), we see that for the entire set of queries, both CRF approaches perform similarly. When we start looking into interesting subsets, differences start to emerge: queries which our off-the-shelf parser deemed difficult to parse by reporting a low confidence score were segmented better by our distant-supervised system. This is an important distinction, because low-confidence queries can be detected in a production system so it can choose which segmentation algorithm to use online. Two other subsets we looked at were queries with just one segment, and those with multiple segments. The supervised system performed better on the former, but the distant supervision did extremely well on the latter, highlighting its improved ability to detect correct segmentation locations (as opposed to not over-detecting them).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-07.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-07.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="1038" data-orig-width="1488" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-07.png"&gt;&lt;img src="https://66.media.tumblr.com/1faae7fdf4f11f9caeda63b1aa829895/tumblr_inline_p96f89d9Mw1rgj0aw_540.png" alt="image" data-orig-height="1038" data-orig-width="1488" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-07.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;br/&gt;&lt;p&gt;When we look at the end-to-end task of parsing, not all findings still apply. For example, Q2Q-based projection really picks up, due to the fact that it parses well-formed sentences with more context than just the query words. In single-segment queries, it even passes the gray bar (seen in the chart below), which signifies the performance of a parser that has access only to the query text, but has full knowledge of the correct query segments in the query.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-08.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/queryparsing-tumblr-08.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="1038" data-orig-width="1486" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-08.png"&gt;&lt;img src="https://66.media.tumblr.com/2b0021c3fdd334f176e60f4df27e1c1e/tumblr_inline_p96f894mG01rgj0aw_540.png" alt="image" data-orig-height="1038" data-orig-width="1486" data-orig-src="https://s.yimg.com/ge/research/queryparsing-tumblr-08.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;br/&gt;&lt;p&gt;In sum, we introduced new tasks for the NLP community, together with a dataset for benchmarking, and proposed several algorithms to address these tasks. We still have a long way to get to really good performance on the most difficult types of queries. We also invite the research community to obtain &lt;a href="http://webscope.sandbox.yahoo.com/catalog.php?datatype=l&amp;amp;did=79" target="_blank"&gt;the dataset&lt;/a&gt; from Webscope and try out other approaches that might do better!&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/145926804326</link><guid>https://yahooresearch.tumblr.com/post/145926804326</guid><pubDate>Tue, 14 Jun 2016 14:07:58 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Yahoo Israel</category><category>NLP</category><category>Parsing</category><category>syntax</category><category>research</category><category>science</category><category>NAACL</category></item><item><title>EURO 2016 According to the Science of Tumblr</title><description>&lt;p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/EURO2016_BracketFINAL.jpg" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/default/691231/EURO2016_BracketFINAL.jpg" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="900" data-orig-width="1441" data-orig-src="https://s.yimg.com/ge/default/691231/EURO2016_BracketFINAL.jpg"&gt;&lt;img src="https://66.media.tumblr.com/6a91df95fb1f0510a681556bc0472611/tumblr_inline_p8wvh66yVJ1rgj0aw_540.jpg" alt="image" data-orig-height="900" data-orig-width="1441" data-orig-src="https://s.yimg.com/ge/default/691231/EURO2016_BracketFINAL.jpg"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;By &lt;a href="https://research.yahoo.com/researchers/alireza" target="_blank"&gt;Ali Sahami&lt;/a&gt;, &lt;a href="http://research.yahoo.com/researchers/phui" target="_blank"&gt;Pik-Mai Hui&lt;/a&gt;, and &lt;a href="https://research.yahoo.com/researchers/mihajlo" target="_blank"&gt;Mihajlo Grbovic&lt;/a&gt;&lt;b&gt;&lt;br/&gt;&lt;/b&gt;&lt;/p&gt;&lt;p&gt;What better task to set a team of football-fanatic Yahoo researchers in the buildup to &lt;a href="http://www.uefa.com/uefaeuro/" target="_blank"&gt;EURO 2016&lt;/a&gt; than to predict the results of the tournament through a combination of big data, science, and social media? Charged with the challenge, we used our expertise and unique access to data from Tumblr – one of the world’s largest social media platforms – as well as Yahoo Sport, to draw conclusions on who will be crowned the champions. &lt;/p&gt;&lt;p&gt;We started by sorting through more than 20 billion posts added to over 24 million Tumblr blogs from January 1st through May 31st of this year. Popular hashtags such as #football, #soccer, #BongDa, #Euro2016 and #UEFA were used to filter through the posts and find the relevant content. We were left with 1.8 million relevant posts which we then scanned for mentions of countries and players.&lt;br/&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_TeamMentions.jpg" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_TeamMentions.jpg" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="3500" data-orig-width="6500" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_TeamMentions.jpg"&gt;&lt;img src="https://66.media.tumblr.com/ae2a140f090e6a0b51611221f697dda6/tumblr_inline_p8wvh7yB0g1rgj0aw_540.jpg" alt="image" data-orig-height="3500" data-orig-width="6500" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_TeamMentions.jpg"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_TeamPlayerMentions.jpg" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_TeamPlayerMentions.jpg" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="6000" data-orig-width="6000" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_TeamPlayerMentions.jpg"&gt;&lt;img src="https://66.media.tumblr.com/86df3b214eab5d28241a67972c0966b9/tumblr_inline_p8wvh9Uqrw1rgj0aw_540.jpg" alt="image" data-orig-height="6000" data-orig-width="6000" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_TeamPlayerMentions.jpg"/&gt;&lt;/figure&gt;&lt;/a&gt;Next, we used Opta data on Yahoo Sport to look back at stats from the past four years for all participating teams in this year’s tournament. We determined the average number of goals scored by each team over that time period, as well as the average number of goals that team’s opponents had scored against them. Having determined all our data points, the fun (read: science-y) part came next.&lt;p&gt;&lt;/p&gt;&lt;p&gt;In order to figure out how each country would stack up against each other, we needed to assign values of strength to each team. These values were calculated according to each matchup and provided a representative game score. &lt;/p&gt;&lt;p&gt;When two teams were positioned to play against each other, we estimated the number of goals each team would score using a &lt;a href="https://en.wikipedia.org/wiki/Poisson_regression" target="_blank"&gt;Poisson regression model&lt;/a&gt; with five variables for each team:&lt;/p&gt;&lt;ol&gt;&lt;li&gt;The average number of goals a participating team’s opponent scored against them in the previous four years.&lt;/li&gt;&lt;li&gt;How many times a participating team was mentioned in EURO 2016-related Tumblr posts.&lt;/li&gt;&lt;li&gt;How many times on average a given participating team’s player was mentioned in EURO 2016-related Tumblr posts.&lt;/li&gt;&lt;li&gt;The standard deviation.&lt;/li&gt;&lt;li&gt;The average number of goals a participating team scored in the previous four years.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_Group.png" target="_blank"&gt;&lt;/a&gt;&lt;/p&gt;&lt;a href="https://s.yimg.com/ge/research/test/EURO2016_Group.png" target="_blank"&gt;&lt;figure class="tmblr-full" data-orig-height="765" data-orig-width="734" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_Group.png"&gt;&lt;img src="https://66.media.tumblr.com/5b7b7e44ff8c5e5baf1815e4c4a83762/tumblr_inline_p8wvhd3JWw1rgj0aw_540.jpg" alt="image" data-orig-height="765" data-orig-width="734" data-orig-src="https://s.yimg.com/ge/research/test/EURO2016_Group.png"/&gt;&lt;/figure&gt;&lt;/a&gt;&lt;p&gt;Finally, we were left with a statistical model predicting the outcome of each successive match-up based on our calculations. Taking into account our 1.8 million relevant posts over the course of the first five months of 2016, we had a complete bracket and a winner: team Germany.&lt;br/&gt;&lt;/p&gt;</description><link>https://yahooresearch.tumblr.com/post/145598623811</link><guid>https://yahooresearch.tumblr.com/post/145598623811</guid><pubDate>Tue, 07 Jun 2016 23:43:31 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>Tumblr</category><category>Yahoo Sport</category><category>Yahoo Sports</category><category>football</category><category>soccer</category><category>bongda</category><category>euro2016</category><category>uefa</category></item><item><title>Promoting a Culture of Learning with Research</title><description>&lt;figure class="tmblr-full" data-orig-height="3456" data-orig-width="5184"&gt;&lt;img src="https://66.media.tumblr.com/75227c093ea6b81a66bca3fd118c4d8e/tumblr_inline_o7swkyjbQG1rgj0aw_540.jpg" data-orig-height="3456" data-orig-width="5184"/&gt;&lt;/figure&gt;&lt;p&gt;At Yahoo, we encourage a culture of learning, both personally and professionally, internally and externally. Yahoo Research, in particular, often takes observations and shares them with the academic community. At the same time, we look to the academic community for revelations to share amongst Yahoos. It is in this open spirit we present a Big Thinkers talk with &lt;a href="http://people.ischool.berkeley.edu/~hearst/" target="_blank"&gt;Dr. Marti Hearst&lt;/a&gt;, a luminary in the fields of Natural Language Processing (NLP) and Search, covering fascinating new insights on learning in Massive Open Online Courses (MOOCs).&lt;br/&gt;&lt;/p&gt;&lt;p&gt;MOOCs have been widely touted as one of the most transformational and disruptive developments in the recent educational landscape. Whereas traditionally one had to physically attend a class, MOOCs have enabled students of all ages from around the world to have affordable access to high-quality courses and instructional materials that five years ago would not have been possible. This technology has thus opened up no shortage of research questions on how we adapt to learning in online settings with remote instructors and peers. Can people learn as effectively in these settings as they do in a classroom, even without face-to-face contact? How can high standards of education be maintained in this modality when scaling from 100 to 1,000 or even over 10,000 users for a single course? MOOCs have the potential to drastically change how we learn and how our children will learn.&lt;/p&gt;&lt;p&gt;In this Big Thinkers talk at Yahoo, Dr. Hearst will discusses her recent research into understanding the best educational practices for MOOCs and how innovative techniques such as peer feedback can improve the engagement and retention of learners in distributed settings.&lt;/p&gt;&lt;center&gt;&lt;iframe width="640" height="360" scrolling="no" frameborder="0" src="https://www.yahoo.com/news/big-thinker-marti-hearst-183249885.html?format=embed" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"&gt;&lt;/iframe&gt;&lt;/center&gt;</description><link>https://yahooresearch.tumblr.com/post/144821165296</link><guid>https://yahooresearch.tumblr.com/post/144821165296</guid><pubDate>Mon, 23 May 2016 13:19:55 -0700</pubDate><category>Yahoo</category><category>Yahoo Research</category><category>MOOCs</category><category>UC Berkeley</category><category>NLP</category><category>Search</category><category>learning</category></item></channel></rss>
