Just provide a link to a public LinkedIn profile and an email address and that’s it. The system will go find other folks on LinkedIn who best match that given profile and email back a summary of the results.
It leverages some very useful IR techniques along with a basic machine learned model to optimize the matching quality.
Some use cases:
If I provide a link to a star engineer, I can find a bunch of folks like that person to go try to recruit. One could also use LinkedIn / Google search to find people, but sometimes it can be difficult to formulate the right query and may be easier to just pivot off an ideal candidate.
I recently shared it with a colleague of mine who just graduated from college. He really wants to join a startup but doesn’t know of any (he just knows about the big companies like Microsoft, Google, Yahoo!, etc.). With this tool he found people who shared similar backgrounds and saw which small companies they work at.
Generally browsing the people graph based on credentials as opposed to relationships. It seems to be a fun way to find like minded people around the world and see where they ended up. I’ve recently been using it to find advisors and customers based on folks I admire.
Anyways, just a fun application I developed on the side. It’s not perfect by any means but I figured it’s worth sharing.
It’s pretty compute intensive, so if you want to try it send mail to [contact at pplmatch dot com] to get your email address added to the list. Also, do make sure that the profiles you supply expose lots of text publicly – the more text the better the results.
My colleagues and I will be giving a talk on BOSS at Yahoo!’s Hack Day in NYC on October 9. To show developers the versatility of an open search API, I developed a simple toy example (see my past ones: TweetNews, Q&A) on the flight over that uses BOSS to generate data for training a machine learned text classifier. The resulting application basically takes two tags, some text, and tells you which tag best classifies that text. For example, you can ask the system if some piece of text is more liberal or conservative.
How does it work? BOSS offers delicious metadata for many search results that have been saved in delicious. This includes top tags, their frequencies, and the number of user saves. Additionally, BOSS makes available an option to retrieve extended search result abstracts. So, to generate a training set, I first build up a query list (100 delicious popular tags), search each query through BOSS (asking for 500 results per), and filter the results to just those that have delicious tags.
Basically, the collection logically looks like this:
To build a model comparing 2 tags, the system selects pairs from the above collection that have matching tags, converts the abstract + title text into features, and then passes the resulting pairs over to LibSVM to train a binary classification model.
Here’s how it works:
tagger viksi$ python gen_training_test_set.py liberal conservative
get_training_test_set finds the pairs with matching tags and split those results into a training (80% of the pairs) and test set (20%), saving the data as training_data.txt and test_data.txt respectively. autosvm learns the best model (brute forcing the parameters for you – could be handy by itself as a general learning tool) and then applies it to the test set, reporting how well it did. In the above case, the system achieved 80% accuracy over 20 test instances.
Here’s another way to use it:
tagger viksi$ python classify.py apple microsoft bill gates steve ballmer windows vista xp
tagger viksi$ python classify.py apple microsoft steve jobs ipod iphone macbook
classify combines the above steps into an application that, given two tags and some text, will return which tag more likely describes the text. Or, in command line form, ‘python classify.py [tag1] [tag2] [some free text]’ => ‘tag1’ or ‘tag2’
My main goal here is not to build a perfect experiment or classifier (see caveats below), but to show a proof of concept of how BOSS or open search can be leveraged to build intelligent applications. BOSS isn’t just a search API, but really a general data API for powering any application that needs to party on a lot of the world’s knowledge.
Although the total lines of code is ~200 lines, the system is fairly state-of-the-art as it employs LibSVM for its learning model. However, this classifier setup has several caveats due to my time constraints and goals, as my main intention for this example was to show the awesomeness of the BOSS data. For example, training and testing on abstracts and titles means the top features will probably be inclusive of the query, so the test set may be fairly easy to score well on as well as not be representative of real input data. I did later add code to remove query related features from the test set and the accuracy seemed to dip just slightly. For classify.py, the ‘some free text’ input needs to be fairly large (about an extended abstract’s size) to be more accurate. Another caveat is what happens when both tags have been used to label a particular search result. The current system may only choose one tag, which may incur an error depending on what’s selected in the test set. Furthermore, the features I’m using are super simple and can be greatly improved with TFIDF scaling, normalization, feature selection (mutual information gain), etc. Also, more training / test instances (and check the distribution of the labels), baselines and evaluation measures should be tested.
I could have made this code a lot cleaner and shorter if I just used LibSVM’s python interface, but I for some reason forgot about that and wrote up scripts that parsed the stdout messages of the binaries to get something working fast (but dirty).
Updated: sphinx setup wasn’t exactly ‘out of the box’. Sphinx searches the fastest now and its relevancy increased (charts updated below).
Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.
For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.
We have split up our upcoming talk into two sections:
Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
Software: How to use popular open source packages for vertical indexing your own data.
While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:
And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.
The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.
So, I developed a couple of fun, off the wall experiments to test (for building code examples – this is just a simple/quick evaluation and not for SIGIR – read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:
One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).
Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.
For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.
So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).
But before indexing, I did some quick analysis of my acquired Twitter data set:
# of Tweets: 968,937
Indexable Text Size (user, name, text message): 92MB
Average Tweet Size: 12 words
Types of Tweets based on simple word filters:
Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?
Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:
Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).
Measuring Relevancy: Medical Data Set
While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.
To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.
With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.
Conclusion & Downloads
Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.
Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I’d be the first one to say these are far from perfect, so I open sourced my code below). It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why :)), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.
To encourage further search development and benchmarks, I’ve open sourced all the code here:
Update: (1/19) If you have issues try again in 5-10 minutes. You can also check out the screenshots below. (1/15) App Engine limits were reached (and fast). Appreciate the love and my apologies for not fully anticipating that. Google was nice enough though to temporarily raise the quota for this application. Anyways, this was more to show a cool BOSS developer example using code libraries I released earlier, but there might be more here. Stay tuned.
Here’s a screenshot as well (which should hopefully be stale by the time you read this).
Basically this service boosts Yahoo’s freshest news search results (which typically don’t have much relevance since they are ordered by timestamp and that’s it) based on how similar they are to the emerging topics found on Twitter for the same query (hence using Twitter to determine authority for content that don’t yet have links because they are so fresh). It also overlays related tweets via an AJAX expando button (big thanks to Greg Walloch at Yahoo! for the design) under results if they exist. A nice added feature to the overlay functionality is near-duplicate removal to ensure message threads on any given result provide as much comment diversity as possible.
Freshness (especially in the context of search) is a challenging problem. Traditional PageRank style algorithms don’t really work here as it takes time for a fresh URL to garner enough links to beat an older high ranking URL. One approach is to use cluster sizes as a feature for measuring the popularity of a story (i.e. Google News). Although quite effective IMO this may not be fast enough all the time. For the cluster size to grow requires other sources to write about the same story. Traditional media can be slow however, especially on local topics. I remember when I saw breaking Twitter messages describing the California Wildfires. When I searched Google/Yahoo/Microsoft right at that moment I barely got anything (< 5 results spanning 3 search results pages). I had a similar episode when I searched on the Mumbai attacks. Specifically, the Twitter messages were providing incredible focus on the important subtopics that had yet to become popular in the traditional media and news search worlds. What I found most interesting in both of these cases was that news articles did exist on these topics, but just weren’t valued highly enough yet or not focusing on the right stories (as the majority of tweets were). So why not just do that? Order these fresh news articles (which mostly provide authority and in-depth coverage) based on the number of related fresh tweets as well as show the tweets under each. That’s this service.
To illustrate the need, here’s a quick before and after shot. I searched for ‘nba’ using Yahoo’s news search ordered by latest results (first image). Very fresh (within a minute) but subpar quality. The first result talks about teams that are in a different league of basketball than the NBA. However, search for ‘nba’ on TweetNews (second image) and you get the Kings/Warriors triple OT game highlight which was buzzing more in Twitter at that minute.
There’s something very interesting here … Twitter as a ranking signal for search freshness may prove to be very useful if constructed properly. Definitely deserves more exploration – hence this service, which took < 100 lines of code to represent all the search logic thanks to Yahoo! BOSS, Twitter’s API, and the BOSS Mashup Framework.
To sum up, the contributions of this service are: (1) Real-time search + freshness (2) Stitching social commentary to authoritative sources of information (3) Another (hopefully cool) BOSS example.
The code is packaged for general open consumption and has been ported to run on App Engine (which powers this service actually). You can download all the source here.
Today I finally plugged-in the Yahoo Boss Mashup Framework into the Google App Engine environment. Google App Engine (GAE) provides a pretty sweet yet simple platform for executing Python applications on Google’s infrastructure. The Boss Mashup Framework (BMF) provides Python API’s for accessing Yahoo’s Search API’s as well remixing data a la SQL constructs. Running BMF on top of GAE is a seemingly natural progression, and quite arguably the easiest way to deploy Boss – so I spent today porting BMF to the GAE platform.
There’s a README file included. Just unzip, put your appid’s in the config files, and you’re done. No setup or dependencies (easier than installing BMF standalone!). It’s a complete GAE project directory which includes a directory called yos which holds all the ported BMF code. Also made a number of improvements to the BMF code (SQL ‘where’ support, stopwords, yql.db refactoring, util & templates in yos namespace, yos.crawl.rest refactored & optimized, etc.).
The next natural thing to do is to develop a test application on top of this united framework. In the original BMF package, there’s an examples directory. In particular, ex6.py was able to answer some ‘when’ style questions. I simply wrapped that code as a function and referenced it as a GAE handler in main.py.
Keep in mind that this is just a quick proof of concept to hopefully showcase the power of BMF and the idea of Open Web Search.
If you’re interested in learning more about this Q&A system (or how to improve it), check out AskMSR – the original inspiration behind this example.
Also, shoutout to Sam for his very popular Yuil example, which is powered by BMF + GAE. The project download linked above is aimed to make it hopefully easier for people to build these types of web services.
Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.
This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:
… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:
It’s simple and clean
You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
Can have bundles (two level tag hierarchies)
Can see who else has bookmarked the site (can also view their comments); builds a user community
Generates a public page serving all your bookmarks
Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.
If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).
Here’s a quick feature list:
Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
Auto-generates the CSE context XML
Auto-generates facet titles
Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
The other option returns the user’s most popular del.ico.us bundles and corresponding tags
Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
Provides help documentation links at key steps
And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!
In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.
Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.
This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).
Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).
One of the projects Jim Gray and I worked on this summer was classifying the types of SQL users ask on the SkyServer site ( http://cas.sdss.org/dr5/en/ ). We were surprised that we could not find any existing research that could describe methods on how to break down the SQL for categorization – especially considering the number of websites and database workloads that bookkeep query logs. Below is a link to the powerpoint presentation I gave at MSR Mountain View last week which describes how we analyzed the SQL. Notable features include text processing strategies, clustering algorithms, distance functions, and two example applications (Bot detection and Query recommendation). We plan to publish our algorithms and results in a technical report in the next month or so – but for now, enjoy the .ppt. As always, comments are more than welcome.
Below is a small class talk I gave on the hierarchical multi-labeling classification framework I outlined in my previous ‘Future of Tagging’ posts. I did a small experiment classifying tech news articles as Pro/Anti- Microsoft/Google (along with some other tags like the tech category and whether the article is a blog or publication based off the text of the piece). The results are very promising – even with such a small corpus of training documents the classifier performed very well. I do have some ideas on how to further improve accuracy, so when time cycles free up I’ll add those to the code and rerun it on a bigger and better (in terms of tag structure) training data set. By then I’m hoping the code will look less embarassing for public release and the results will be more conclusive – but until that day here are the presentation slides:
Just wanted to let people know that I’ve changed my algorithms/framework for hierarchical mult-labeling classification quite a bit. One thing that really bugged me about my initial idea was the error correction scheme – i.e. sampling the tag network (a bayesian/mrf hybrid) for closely related bitstrings. All the SAT/conditional probability table values in this network are generated from the number of times tags occur together in the training data, thus making my error correction scheme a popularity contest. But what about the feature values? We SHOULD take these values into account and try to reduce our new input down to a training data example with closely related feature values THAT also happens to have a similar tag bitstring (based off the prediction string outputted by the binary classifiers).
With regards to assuming there are k errors in the bitstring (call it b) we get back from the classifiers – before we sampled new candidate bitstrings based off the bitpattern produced after randomly unsetting k bits in b. Instead, since many classifiers (like the support vector I’m using) can return a probability confidence associated to the 0/1 output, my new algorithm chooses the k positions to unset not uniformly at random, but rather with a bias towards the bits with the smallest probabilities (since they are most likely the erroneous ones according to the classifiers).
Another thing I added were two tag normalization rules for determining how to choose labels:
No more than one tag from each tree/hierarchy
Each tag must be a leaf node in a tree
Why the rules? It provides some level of control for the placement and generality of the tags. The first one ensures there’s some separation/disjointness among the tags. And for the second – I was afraid of mixing general and very specific tags together in a grouping because it could hurt my learner’s accuracy (since the tags/labels are not on the same par). By forcing tags to be leaf nodes in the trees we sort of normalize the labels to be on the same weighted level of specificity.
Another note – when generating the tag binary classifiers, I originally proposed just taking all the files/features that map to a label grouping that contains that tag (set as the y=1 cases in the binary classifier’s training data model) and all the files/features that map to a grouping that does not contain the tag for the y=0 cases. However, this splitting up of the data seems likely to produce many bad/unnecessary features since (1) there can be a LOT of 0 cases and (2) 0 case files/examples can deal with ANYTHING, inducing their completely irrelevant features to the tag’s binary classifier’s model. But we have a way out of this dilemma thanks to the tag normalization rules above – since we can only choose a single tag from each tree, we can use all the inputs/files/training data examples that map to other leaf-node tags in the SAME tree for the zero cases. This selection of 0 case files scopes the context down to one label hierarchy/tree that contains the tag we’re trying to learn.
Anyway, I’ll try to post the pseudo code (and actual code) for my algorithms and some initial experimental results on this blog shortly. Additionally, expect a tutorial describing the steps/software I used to perform these tests.