Category Archives: Data Mining

Yahoo! Boss – An Insider View

Disclaimer: This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Boss stands for Build your Own Search Service. The goal of Boss is to open up search to enable third parties to build incredibly useful and powerful search-based applications. Several months ago I pitched this idea to the executives on how Yahoo! can specifically open up its search assets to fragment the market. It’s remarkable to finally see some of the vision (with the help of many talented people) reach the public today.

Web search is a tough business to get into. $300+ Million capex, amazing talent, infrastructure, a prayer, etc. just to get close to basic parity. Only 3 companies have really pulled it off. However, I strongly believe we need to find innovative, incremental ways to spread the search love in order to encourage fragmentation and help promising companies get to basic parity instantly so that they can leverage their unique assets (new algorithm, user data, talent) to push their search solution beyond the current baseline.

Search is all about understanding the user’s intent. If we can nail the intent, then search is pretty much a solved problem. However, the current model of a single search box for everything loses an intent focus as it aims to cater to all people and queries. Albeit, a single search box definitely makes our lives easier, but I have a hard time believing this is the *right* approach.

In my online experience, I typically visit a variety of sites: Techmeme, Digg, Techcrunch, eBay, Amazon, del.icio.us, etc. While on these pages, something almost always catches my eye, and so I proceed to the search box in my browser to find out more on the web. Why do we have this disconnected experience? I think it’s because these sites do not provide web-level comprehensiveness. It’s unfortunate, because the page that I’m on may have additional information about my intent (maybe I’m logged in so it has my user info, or it’s a techy shopping site).

The biggest goal of Boss is to help bootstrap sites like these to get comprehensiveness and basic ranking for free, as well as offer tools to re-rank, blend, and overlay the results in a way that revolutionizes the search experience.

When I’m on del.icio.us, why can’t I search in their box, get relevant del.icio.us results at the top, and also have web results backfill below? I think users should be confident that if they searched in a search box on any page in the whole wide web that they’ll get results that are just as good as Yahoo/Google and only better.

The first milestone of Boss is a simple one: Make available a clean search API that turns off the traditional restrictions so that developers can totally control presentation, re-rank results, run an unlimited number of queries, and blend in external content all without having to include any Yahoo! attribution in the resulting product(s). Want to build the example above or put news search results on a map – go for it!

Here’s a link to the API:

http://developer.yahoo.com/search/boss/

Also, check out the Boss Mashup Framework:

http://developer.yahoo.com/search/boss/mashup.html

The Boss Mashup Framework in my opinion makes the Boss Search API really useful. It lets developers use SQL like syntax for operating on heterogeneous web data sources. The idea came up as I was working on examples to showcase Boss, and realized the operations I was developing imperatively followed closely to declarative SQL like constructs. Since it’s a recent idea and implementation, there may be some bugs or weird designs lurking in there, but I strongly recommend playing around with it and viewing the examples included in the package. I’m biased of course but do think it’s a fun framework for remixing online data. One can rank web results by digg and youtube favorite counts, remove duplicates, and publish the results using a provided search results page template in less than 30 lines of code and without having to specify any parsing logic of the data sources/API’s as the framework can infer the structure and unify the data formats automatically in most cases.

The next couple of milestones for Boss I think are even more interesting and disruptive – server side services, monetization, blending ranking models, more features exposure, query classifiers, open source … so stay tuned.

46 Comments

Filed under Blog Stuff, Data Mining, Information Retrieval, Non-Technical-Read, Open, Search, Techmeme

Techmeme Leaderboard 2007 – More!

I’m an avid reader of Techmeme. Love the idea, UI, freshness, coverage, and most of all the quality of the articles.

When the Techmeme Leaderboard debuted earlier this month, lots of buzz circulated the blogosphere. Me, being a huge fan of partying on data, loved the concept, and wanted to take the analysis even further (Yuvi style, but with a search twist).

So yesterday I wrote up some code to crawl and analyze Techmeme articles over the whole year (Leaderboard shows the Top 50 sources for this month). I took a snapshot of Techmeme at 1:00PM every day between beginning January – end of September of 2007.

I computed basic statistics, like number of stories by author and source, as well as more involved measurements like the top word mentions of the year – in total and by category (used simple NLP to clean up the text and remove stopwords).

So, without further ado, here are the results:

Number of Stories by Author in 2007, Ranked
Number of Stories by Source in 2007, Ranked
Most Mentioned Words in 2007, Ranked
* words are stemmed
Most Mentioned Words, by Category, Trends in 2007, Ranked

Hope you guys find these results super interesting and useful.

1 Comment

Filed under Blog Stuff, Data Mining, Information Retrieval, NLP, Statistics, Techmeme, Trends