It’s totally unbearable and massively inefficient to process countless emails every day. And yet, to have any chance of success in today’s information world, you must communicate via email.
As you succeed, you become more networked, and more dependent on others to achieve even bigger milestones. As a result, your email volume just increases, while higher expectations require even faster responses and decision making. It’s a seemingly impossible cycle.
This is especially true for C-level and executive leaders. I was chatting recently with Suresh Khanna, Chief Revenue Officer at AdRoll, and he said it best: “Management is about making decisions – not executing. You need to delegate execution efficiently. You need to listen and keep everyone aligned on the same page.
“So, when it comes to doing this over email, you mainly serve as an email routing and forwarding agent.” (Read More)
My colleagues and I will be giving a talk on BOSS at Yahoo!’s Hack Day in NYC on October 9. To show developers the versatility of an open search API, I developed a simple toy example (see my past ones: TweetNews, Q&A) on the flight over that uses BOSS to generate data for training a machine learned text classifier. The resulting application basically takes two tags, some text, and tells you which tag best classifies that text. For example, you can ask the system if some piece of text is more liberal or conservative.
How does it work? BOSS offers delicious metadata for many search results that have been saved in delicious. This includes top tags, their frequencies, and the number of user saves. Additionally, BOSS makes available an option to retrieve extended search result abstracts. So, to generate a training set, I first build up a query list (100 delicious popular tags), search each query through BOSS (asking for 500 results per), and filter the results to just those that have delicious tags.
Basically, the collection logically looks like this:
To build a model comparing 2 tags, the system selects pairs from the above collection that have matching tags, converts the abstract + title text into features, and then passes the resulting pairs over to LibSVM to train a binary classification model.
Here’s how it works:
tagger viksi$ python gen_training_test_set.py liberal conservative
get_training_test_set finds the pairs with matching tags and split those results into a training (80% of the pairs) and test set (20%), saving the data as training_data.txt and test_data.txt respectively. autosvm learns the best model (brute forcing the parameters for you – could be handy by itself as a general learning tool) and then applies it to the test set, reporting how well it did. In the above case, the system achieved 80% accuracy over 20 test instances.
Here’s another way to use it:
tagger viksi$ python classify.py apple microsoft bill gates steve ballmer windows vista xp
tagger viksi$ python classify.py apple microsoft steve jobs ipod iphone macbook
classify combines the above steps into an application that, given two tags and some text, will return which tag more likely describes the text. Or, in command line form, ‘python classify.py [tag1] [tag2] [some free text]’ => ‘tag1’ or ‘tag2’
My main goal here is not to build a perfect experiment or classifier (see caveats below), but to show a proof of concept of how BOSS or open search can be leveraged to build intelligent applications. BOSS isn’t just a search API, but really a general data API for powering any application that needs to party on a lot of the world’s knowledge.
Although the total lines of code is ~200 lines, the system is fairly state-of-the-art as it employs LibSVM for its learning model. However, this classifier setup has several caveats due to my time constraints and goals, as my main intention for this example was to show the awesomeness of the BOSS data. For example, training and testing on abstracts and titles means the top features will probably be inclusive of the query, so the test set may be fairly easy to score well on as well as not be representative of real input data. I did later add code to remove query related features from the test set and the accuracy seemed to dip just slightly. For classify.py, the ‘some free text’ input needs to be fairly large (about an extended abstract’s size) to be more accurate. Another caveat is what happens when both tags have been used to label a particular search result. The current system may only choose one tag, which may incur an error depending on what’s selected in the test set. Furthermore, the features I’m using are super simple and can be greatly improved with TFIDF scaling, normalization, feature selection (mutual information gain), etc. Also, more training / test instances (and check the distribution of the labels), baselines and evaluation measures should be tested.
I could have made this code a lot cleaner and shorter if I just used LibSVM’s python interface, but I for some reason forgot about that and wrote up scripts that parsed the stdout messages of the binaries to get something working fast (but dirty).
Updated: sphinx setup wasn’t exactly ‘out of the box’. Sphinx searches the fastest now and its relevancy increased (charts updated below).
Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.
For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.
We have split up our upcoming talk into two sections:
Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
Software: How to use popular open source packages for vertical indexing your own data.
While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:
And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.
The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.
So, I developed a couple of fun, off the wall experiments to test (for building code examples – this is just a simple/quick evaluation and not for SIGIR – read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:
One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).
Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.
For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.
So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).
But before indexing, I did some quick analysis of my acquired Twitter data set:
# of Tweets: 968,937
Indexable Text Size (user, name, text message): 92MB
Average Tweet Size: 12 words
Types of Tweets based on simple word filters:
Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?
Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:
Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
Measuring Relevancy: Medical Data Set
While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.
To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.
With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
Conclusion & Downloads
Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.
Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I’d be the first one to say these are far from perfect, so I open sourced my code below). It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why :)), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.
To encourage further search development and benchmarks, I’ve open sourced all the code here:
So what is it? It’s called Google Co-op, a platform which enables users to build their own vertical search engines and make money off the advertisements. It provides a clean, easy interface for simple site restrictions (like what Yahoo! Search Builder and Live Macros offer) plus a number of power user features for tweaking the search results. The user has control over the look and feel (to embed the search box on their own site), can rank results, and even (multi) tag sites to let viewers filter out results by category.
But talk is cheap. So let me show you some examples of what you can do with Co-op:
This is a technology specific search engine, which lets users refine results based off Google Topics (global labels which anyone can annotate with). Basically, I was lazy here. I didn’t feel like multi-tagging sites/domains individually, so instead I just collected a laundry list of popular technology site domains in a flat file and pasted it into Google Co-op’s Custom Search Engine control panel/sites page. In addition, something I think is really useful, Google Co-op allows users to bulk upload links from OPML files. So, to make my life easier when building this, I uploaded Scoble’s and Matt Cutt’s OPML’s. Tons of great links there (and close to 1000 total). Then I clicked on the ‘filter results to just the sites I listed’ option (which I recommend you use since if you muddle your results with normal Google web search’s you typically won’t see your results popping up on the first page of results despite the higher priority level promise for hand chosen sites). To enable the filters you see on the results page (Reviews, Forums, Shopping, Blogs, etc.), I did an intersection with the background label of my search engine and the Google Topics labels. How do you that? The XML context configuration exposes a <BackgroundLabels> tag. Any labels listed in the BackgroundLabels block will be AND’ed (how cool is that). So I added the label of my search engine (each search engine has a unique background label – it can be found bolded on the Advanced Tab page) and a Google Topic label (News, Reviews, Stores, Shopping_Comparison, Blogs, Forums, etc.) in the BackgroundLabels XML block. I made a separate XML context file for each Google Topic intersection. By doing this, I didn’t have to tag any of my results and was still able to provide search filters. Google Topics does most of the hardwork and gives me search refinements for free!
But say you’re not lazy. Here’s an example of what you can do with multi-tagging and refinements.
This one is more of a power user example – notice the refinements onebox on the search results page, and the labels with “>>” at the end. These labels redirect to another label hierarchy (a hack, I used the label redirect XML option to link to other custom search engine contexts – basically I’m nesting search engines here)
Now, say you want to get fancy with the search results presentation. Here’s a way to do it with Google’s Ajax Search API:
While writing this blog post, I realized it would take me forever to go over the number of tricks one can pull with Co-op. Instead, I’ll summarize some of the big selling point features to encourage everyone to start hacking away. Also, to help jump start power users, I’ve linked the XML files I used to make my featured search examples at the bottom of this post.
Key Feature Summary (in no particular order):
Make money (get a share off the ad clicks)
Have up to 5000 annotations
Can collaborate with friends to tag sites (I’ve made my search engines public so anyone can add their annotations)
Can associate weights to results and have control over the rankings (refer to the <score> tag in the XML)
Completely brand the engine and customize the look and feel
Can combine results with another person’s search results
(by intersecting the background labels in the XML advanced configuration file)
and much much more (especially for power users).
If you need a search engine for your site, and your content has been indexed by Google, then seriously consider using this rather than building your own index – or worse, using the crappy full-text functions available in relational databases.