Monthly Archives: January 2007

Google Co-op just got del.icio.us!

Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.

This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:

‘linear regression” on my Machine Learning Search Engine

… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:

  • It’s simple and clean
  • You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
  • You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
  • The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
  • Can have bundles (two level tag hierarchies)
  • Can see who else has bookmarked the site (can also view their comments); builds a user community
  • Generates a public page serving all your bookmarks

Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.

Ok, enough talk, here’s the final product:
http://basundi.com:8000/login.html

If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).

Here’s a quick feature list:

  • Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
  • Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
  • Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
  • Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
  • Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
  • Auto-generates the CSE context XML
  • Auto-generates facet titles
  • Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
    • The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
    • The other option returns the user’s most popular del.ico.us bundles and corresponding tags
    • Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
  • Provides help documentation links at key steps
  • And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!

In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.


Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.

74 Comments

Filed under AI, Co-op, CS, CSE, Google, Machine Learning, Research, Tagging

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).

3 Comments

Filed under CS, Databases, Education, Publications, Research, Science