Category Archives: Databases

What I learned working under Turing Award winner Jim Gray – 10 years since his disappearance

A few days ago, I sent out the following email remembering Jim to close friends and colleagues. I did not intend to share this broadly, but I received many positive replies encouraging me to post this publicly … so, here it is:

10 years ago this month, my mentor and idol Jim Gray disappeared at sea. I had the greatest fortune to work under him. We had published a paper together in the weeks leading up to his final sail.

I learned so much from Jim, and I think about him a lot. We even incorporated our company name after his saying “party on the data” (Party On Data, Inc.). To this day, I continue to unpack and internalize the lessons that I absorbed while working with him more than a decade ago.

I learned it’s important to make time for the unexpected. It felt like nearly everyone I knew in my circle had talked to Jim at some point – even people from very different fields of study. I am not sure how he was able to be so generous with his time given his position and stature, but if someone reached out to him with an interesting hook and was passionate about it, he made time. And, it wasn’t just a meet and greet – he truly listened. He would be engrossed in the conversation, and listen intently like you were the professor and he was the student. He made you feel special; that you had some unique insight about a very important problem area.

Jim’s projects were proof that making time and having an open mind for the unexpected – to converse and collaborate with people beyond your direct connections – can lead to breakthroughs in other disciplines. He made significant contributions to the SkyServer project, which helped astronomers federate multiple terabytes of images to serve as a virtual observatory (a world-wide telescope). He applied a similar approach to mapping data with the Virtual Earth project (the precursor to Google Maps – minus the AJAX).

In today’s world, with so many distractions and communication channels (many of which are being inundated with spam), it has become commonplace to ignore cold inbound requests. However, I learned from Jim that it’s crucial to make time for surprises, and to give back. No other Turing award winner responded to my emails and calls – only Jim did – and by doing so, he completely changed my life for the better. Jim instilled confidence in me that I mattered in this world, if someone important like him was willing to invest his precious time with me.

I learned from Jim that it’s important to tackle very good and crisp problems – and to work diligently on them (and to write everything down). Jim had a knack for identifying great problems. Comb through his website – it’s hard to find a dud in his resume or project list. I remember we were in talks with a major hospital about an ambitious project to improve the detection of diseases. The hospital group was willing to support this high profile project in any way we needed (thanks to Jim being a rock star), but Jim immediately knew we wouldn’t be able to develop crisp, tangible results within a year. He wanted more control, and craved a project with more short-term wins.

When Jim did identify a crisp problem to work on, he went all-in. His work ethic was second to none. We were once at a baseball game together, and I could tell from his demeanor that he was itching to get back to the office to continue our work. If I was working late in the office, he would work late too. He remained technical (writing code right next to me) and deep in the weeds despite his senior management role. He was responsive. Late night emails to emails at 5 AM (he liked waking up with the birds). He pushed me to work harder – not by asking for it, but by leading by example.

With any project, but especially database projects, there are so many low-level, unsexy problems (like data cleaning) that have to be addressed before you can “party on the data.” “99% perspiration and 1% inspiration,” he would always say, like it was a constant, inevitable force of nature that we have to equip ourselves for. He prepared me for that, which taught me how to stay focused and work harder.

I learned that it’s important to learn about key inflection points from previous products and projects – to know your history in order to make better decisions. Jim was a master story teller – constantly reciting history. I still remember his story about how Sybase was outgunned in the database market, but their innovation with stored procedures gave them the differentiation they needed to fight the fight with DB2 and Oracle. And, by the way, he was very laudatory of key features coming from competitors. He would never dismiss them – he loved the innovation, no matter where it came from. He wanted the truth for how to best solve a particular problem.

He loved to teach his lessons too. I recall one time I asked him a technical question, and an important call came through to his desk phone. He immediately hung up the call and took me to the whiteboard to teach me what he knew about the topic in question. Who does that? You’d be lucky to meet with your thesis advisor or manager once a week for 30 minutes, but Jim was present for me like this almost every day.

Jim set the highest management bar imaginable for me. He showed me why I should optimize 100% for mentorship throughout my career – not company brand – and to do this every time.

I sometimes wish he could see me now, as I feel like I wasn’t able to show him everything that I could do then, as I was still in the infancy of my career. I know better now where I excel (and where I don’t). At the time, I wanted to learn and do it all, like there was no tomorrow. He encouraged me to follow my passions – even if they were outside his comfort zone. Jim had no ego – he would loop in another mentor who knew more about a particular subject area. He gave me rope to learn, fail and rebuild. I tried to savor every minute I had with Jim, and am thankful that I did.

Despite his amazing technical accomplishments, I honestly do not remember many of the technical concepts that he had taught me. What I remember is how he made me feel. That’s what lives on and matters most. He gave me confidence, by just responding to me, and of course, working side by side with me. He rewarded my proactive outreach (which certainly encouraged me to send many more cold emails thereafter), and most importantly, taught me how to approach and solve big problems.

Jim truly inspires me, and I am forever grateful for what he did for me and my career. I sincerely hope that one day, I too, can have such a profound positive influence on so many people’s lives.

To being tenacious like Jim.

Leave a comment

Filed under Computer Science, Databases, Education, Job Stuff, Management, Microsoft, Publications

A Comparison of Open Source Search Engines

Updated: sphinx setup wasn’t exactly ‘out of the box’. Sphinx searches the fastest now and its relevancy increased (charts updated below).

Motivation

Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.

For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.

We have split up our upcoming talk into two sections:

  • Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.
  • Software: How to use popular open source packages for vertical indexing your own data.

While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:

And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.

The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.

So, I developed a couple of fun, off the wall experiments to test (for building code examples – this is just a simple/quick evaluation and not for SIGIR – read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations (please feel free to comment).

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution's latest stable release as of this week (Indri is TODO).

One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).

Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.

Twitter Experiment

For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.

So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).

But before indexing, I did some quick analysis of my acquired Twitter data set:

# of Tweets: 968,937

Indexable Text Size (user, name, text message): 92MB

Average Tweet Size: 12 words

Types of Tweets based on simple word filters:

Out of a 1M sample, what kind of Tweet types do we find?

Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.

Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?

Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:

Indexing 1M twitter messages on a variety of open source search solutions; measuring time and space for each.

Indexing 1M twitter messages on a variety of open source search solutions.

Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).

Measuring Relevancy: Medical Data Set

While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.

To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.

Performance and Relevancy marks on the TREC OHSUMED Data Set; Lucene is the smallest, most relevant and fastest to search; Xapian is very close to Lucene on the search side but 3x slower on indexing and 4x bigger in index space; zettair is the fastest indexer.

Performance and Relevancy marks on the TREC-9 across select vertical search solutions.

With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.

Conclusion & Downloads

Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.

Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I’d be the first one to say these are far from perfect, so I open sourced my code below). It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why :)), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.

To encourage further search development and benchmarks, I’ve open sourced all the code here:

http://github.com/zooie/opensearch/tree/master

Happy to post any new and interesting results.

146 Comments

Filed under Blog Stuff, Boss, Code, CS, Data Mining, Databases, Information Retrieval, Job Stuff, Open, Open Source, Performance, Research, Search, Statistics, Talk, Tutorial, Twitter

Yahoo Boss – Google App Engine Integrated

Updated: I see blogs doing evaluations of the Q&A engine. I have to admit, that wasn’t my focus here. The service is merely 50 lines of code … just to demonstrate the integration of BMF and GAE.

Updated: Direct link to the example Question-Answering Service

Today I finally plugged-in the Yahoo Boss Mashup Framework into the Google App Engine environment. Google App Engine (GAE) provides a pretty sweet yet simple platform for executing Python applications on Google’s infrastructure. The Boss Mashup Framework (BMF) provides Python API’s for accessing Yahoo’s Search API’s as well remixing data a la SQL constructs. Running BMF on top of GAE is a seemingly natural progression, and quite arguably the easiest way to deploy Boss – so I spent today porting BMF to the GAE platform.

Here’s the full BMF-GAE integrated project source download.

There’s a README file included. Just unzip, put your appid’s in the config files, and you’re done. No setup or dependencies (easier than installing BMF standalone!). It’s a complete GAE project directory which includes a directory called yos which holds all the ported BMF code. Also made a number of improvements to the BMF code (SQL ‘where’ support, stopwords, yql.db refactoring, util & templates in yos namespace, yos.crawl.rest refactored & optimized, etc.).

The next natural thing to do is to develop a test application on top of this united framework. In the original BMF package, there’s an examples directory. In particular, ex6.py was able to answer some ‘when’ style questions. I simply wrapped that code as a function and referenced it as a GAE handler in main.py.

Here’s the ‘when’ q&a source code as a webpage (less than 25 lines).

The algorithm is quite easy – use the question as the search query and fetch 50 results via the Boss API. Count the dates that occur in the results’ abstracts, and simply return the most popular one.

For fun, following a similar pattern to the ‘when’ code, I developed another handler to answer ‘who’ or ‘what’ or ‘where’ style questions (finding the most popular capitalized phrase).

Here’s the complete example (just ~50 lines of code – bundled in project download):

Q&A Running Service Example

Keep in mind that this is just a quick proof of concept to hopefully showcase the power of BMF and the idea of Open Web Search.

If you’re interested in learning more about this Q&A system (or how to improve it), check out AskMSR – the original inspiration behind this example.

Also, shoutout to Sam for his very popular Yuil example, which is powered by BMF + GAE. The project download linked above is aimed to make it hopefully easier for people to build these types of web services.

34 Comments

Filed under Boss, Code, Computer Science, CS, Data Mining, Databases, Google, Information Retrieval, NLP, Research, Search, Yahoo

How Google is putting us back into the Stone Age

Yeah, I know – what a linkbait title. If that’s what it takes these days to get visitors and diggs then so be it. Also, just to forewarn, as you read this you might find that a better title choice for this post would have been “How Web 2.0 is putting us back into the Stone Age” since many of these thoughts generalize to Web 2.0 companies as a whole. I used Google in the title mainly because they are the big daddy in the web world, the model many web 2.0 companies strive to be like, the one to beat. Plus, the title just looks and sounds cooler with ‘Google’ in it.

Here’s the main problem I have with web applications coming from companies like Google: About 2 years ago I bought a pretty good box – which is now fairly standard and cheap these days – 2 gigs of ram, dual core AMD-64 3400+’s, 250 gigs hd, nVidia 6600 GT PCI Express, etc. It’s a beast. However, because I don’t play games, its potential isn’t being utilized – not even close. Most of the applications I use are web-based, mainly because the web provides a medium which is cross platform (all machines have a web browser), synchronized (since the data is stored server side I can access it from anywhere like the library, friend’s computer, my laptop) and it keeps my machine pretty light (no need to install anything and waste disk and risk security issues). The web UI experience for the most part isn’t too bad either – in fact, I find that the browser’s restrictions force many UI’s to be far simpler and easier to use. To me, the benefits mentioned above clearly compensate for any UI deficiencies. Unfortunately, this doesn’t mean that Web 2.0 is innovating the user’s experience. Visualizing data – search results, semantic networks, social networks, excel data sheets – is still very primitive, and a lot can be done to improve this experience by taking advantage of the user’s hardware.

My machine, and most likely yours, is very powerful and underutilized. For instance, my graphics card has tons of cores. We live in an age where GPU’s like mine can sort terabytes of data faster than the top-of-the-line Xeon based workstation (refer to Jim Gray’s GPUTerasort paper). For sorting, which is typically the bottleneck in database query plans and MapReduce jobs, it’s all about I/O – or in this case, how fast you can swap memory (for example, a 2-pass bitonic radix sort iteratively swaps the lows and the highs). Say you call memcpy in your C program on a $6,000 Xeon machine. The memory bandwidth is about 4 GB/s. Do the equivalent on a $200 graphics co-processor and you get about 50 GB/s. Holy smokes! I know I’m getting off-topic here, but why is it so much faster on a GPU? Well, in CPU world, memory access can be quite slow. You have almost these random jumps in memory, which can result in expensive TLB/cache misses, page faults, etc. You also have context switching for multi-processing. Lots of overhead going on here. Now compare this with a GPU, which has the memory almost stream directly to tons of cores. The cores on a GPU are fairly cheap, dumb processing units in comparison to the cores found in a CPU. But the GPU uses hundreds of these cores, in parallel, to drastically speed-up the overall processing. This, coupled with its specialized memory architecture, results in amazing performance bandwidth. Also, interestingly, since these cores are cheap (bad), there’s a lot of room for improvement. At the current rate, GPU advancements are occurring 3-4x faster than Moore’s law for CPU’s. Additionally, the graphical experience is near real-life quality. Current API’s enable developers to draw 3D triangles directly off the video card! This is some amazing hardware folks. GPU’s, and generally this whole notion of co-processing to optimize for operations that lag on CPU’s (memory bandwidth, I/O) promise to make future computers even faster than ever.

OK, so the basic story here is our computers are really powerful machines. The web world doesn’t take advantage of this, and considering how much time we spend there, it’s an unfortunate waste of computing potential. Because of this, I feel we are losing an appreciation for our computer’s capabilities. For example, when my friend first started using Gmail, he was non-stop clicking on the ‘Invite a friend’ drop-down. He couldn’t believe how the page could change without a browser refresh. Although this is quite an extreme example, I’ve seen this same phenomena for many users on other websites. IMHO, this is completely pathetic, especially when considering how powerful client-end applications can be in comparison.

Again, I’m not against web-based applications. I love Gmail, Google Maps, Reader, etc. However, there are applications which I do not think should be web-based. An example of this is YouOS, which is an OS accessible through the web-browser. I mean, there’s some potential here, but the way it’s currently implemented is very limiting and unnecessary.

To me, people are developing web-services with the mindset ‘can it hurt?’, when I think a better mantra is ‘will it advance computing and communication?’. Here’s the big web 2.0 problem: Just because you can make something web 2.0’ish, doesn’t mean you should. I think of this along the lines of Turing Complete, which is a notion in computer science for determining whether a system can express any computation. Basically, as long as you can process an input, store state, and return an output (i.e. a potentially stateful function), you can do any computation. Now web pages provide an input form, perform calculations server side, and can generate outputting pages – enough to do anything according to this paradigm, but with extreme limitations on visualization and performance (like with games). AJAX makes web views richer, but it is not only a terribly hacked up programming model, but for some reason compels developers to convert previously successful client-end-based applications into web-based services. Sometimes this makes sense from an end-user perspective, but consequently results in dumbing down the user experience.

We have amazing hardware that’s not being leveraged in web-based services. Browsers provide an emulation for a real application. However, given the proliferation of AJAX web 2.0 services, we’re starting to see applications only appear in the browser and not on the client. I think this current architecture view is unfortunate, because what I see in a browser is typically static content – something I could capture the essence of with a camera shot. In some sense, Web 2.0 is a surreal hack on what the real online experience should be.

I feel we really deserve truly rich applications that deliver ‘Minority Report’ style interfaces that utilize the client’s hardware. Movies predating the 1970’s predicted so much more for our current state’s user experience level. It’s up to us, the end-consumer, to encourage innovation in this space. It’s up to us, the developer, to build killer-applications that require tapping into a computer’s powerful hardware.The more we hype up web 2.0 and dumb-downed webpage experiences, the more website-based services we get – and consequently, less innovation in hardware driven UI’s.

But there’s hope. I think there exists a fair compromise between client-end applications and server-side web services. Internet is getting faster, the browser + Flash are getting fine tuned to make better use of a computer’s resources. Soon, the internet will be well-suited for thin-client computing. A great example of this already exists today, and I’m sure many of you have used it: Google Earth. It’s a client-end application – taking advantage of the computer’s graphics and processing power to make the user feel like he/she is traveling in and out of space – while being a server-side service since it gathers updated geographical data from the web. The only problem is there’s no cross-platform, preexisting layer to build applications like this. How do we make these services without forcing the user to do an interventionist, slow installation? How do we make it run over different platforms? Personally, I think Microsoft completely missed the boat here with .NET. If MS could have recognized the web phenomena early on, they could have build this layer into Vista to encourage users to develop these rich thin-client applications, while also promoting Vista. I have no reason to change my OS – this could have been my reason! Even if it was cross platform, if they had better performance it’s still a reason to prefer (providing some business case). Instead, they treated .NET as a Java-based replacement for MFC, thereby forcing developers to resort to building their cross-platform, no-installation-required services through AJAX and Flash.

Now, even if this layer existed, which would enable developers to build and instantly deploy Google Earth style applications in a cross-platform manner, there would be security concerns. I mean, one could make the case that ActiveX attempted to do this – allowing developers to run arbitrary code on the client’s machines. Unfortunately, this led to numerous viruses. Security violations and spyware scare(d) all of us – so much so that we now do traditionally client-end functions through a dumb-downed web browser interface. But, I think we made some serious inroads in security since then. The fact that we even recognize security in current development makes us readily prepared to support such a platform. I am confident that the potential security issues can be tackled.

To make a final point, I think we all really need higher expectations in the user experience front. We need to develop killer applications that push the limitations of our hardware – to promote innovation and progress. We’re currently at a standstill in my opinion. This isn’t how the internet should be. This is not how I envisioned the future to be like 5 years ago. We can do better. We can build richer applications. But to do this, we as consumers must demand it in order for companies to have a business case to further pursue it. We need developers to come up with innovative ways of visualizing the large amounts of data being generated with the use of hardware – thereby delivering long-awaited killer-applications for our idly computers. Let’s take our futuristic dreams and finally translate them into our present reality.

7 Comments

Filed under Computer Science, Databases, Google, Hardware, UI, Web2.0

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).

3 Comments

Filed under CS, Databases, Education, Publications, Research, Science

Google Co-op — An Intro & Some Insider Hacks

http://www.google.com/coop

So what is it? It’s called Google Co-op, a platform which enables users to build their own vertical search engines and make money off the advertisements. It provides a clean, easy interface for simple site restrictions (like what Yahoo! Search Builder and Live Macros offer) plus a number of power user features for tweaking the search results. The user has control over the look and feel (to embed the search box on their own site), can rank results, and even (multi) tag sites to let viewers filter out results by category.

But talk is cheap. So let me show you some examples of what you can do with Co-op:

http://vik.singh.googlepages.com/techstuff

This is a technology specific search engine, which lets users refine results based off Google Topics (global labels which anyone can annotate with). Basically, I was lazy here. I didn’t feel like multi-tagging sites/domains individually, so instead I just collected a laundry list of popular technology site domains in a flat file and pasted it into Google Co-op’s Custom Search Engine control panel/sites page. In addition, something I think is really useful, Google Co-op allows users to bulk upload links from OPML files. So, to make my life easier when building this, I uploaded Scoble’s and Matt Cutt’s OPML’s. Tons of great links there (and close to 1000 total). Then I clicked on the ‘filter results to just the sites I listed’ option (which I recommend you use since if you muddle your results with normal Google web search’s you typically won’t see your results popping up on the first page of results despite the higher priority level promise for hand chosen sites). To enable the filters you see on the results page (Reviews, Forums, Shopping, Blogs, etc.), I did an intersection with the background label of my search engine and the Google Topics labels. How do you that? The XML context configuration exposes a <BackgroundLabels> tag. Any labels listed in the BackgroundLabels block will be AND’ed (how cool is that). So I added the label of my search engine (each search engine has a unique background label – it can be found bolded on the Advanced Tab page) and a Google Topic label (News, Reviews, Stores, Shopping_Comparison, Blogs, Forums, etc.) in the BackgroundLabels XML block. I made a separate XML context file for each Google Topic intersection. By doing this, I didn’t have to tag any of my results and was still able to provide search filters. Google Topics does most of the hardwork and gives me search refinements for free!

But say you’re not lazy. Here’s an example of what you can do with multi-tagging and refinements.

http://vik.singh.googlepages.com/machinelearningsearch2

This one is more of a power user example – notice the refinements onebox on the search results page, and the labels with “>>” at the end. These labels redirect to another label hierarchy (a hack, I used the label redirect XML option to link to other custom search engine contexts – basically I’m nesting search engines here)

Now, say you want to get fancy with the search results presentation. Here’s a way to do it with Google’s Ajax Search API:

http://www.google.com/uds/samples/cse/index.html

Thanks to Mark Lucovsky and Matt Wytock for developing that great example.
For more information about how to use the Ajax Search API with Custom Search, please take a look at this informative post: http://googleajaxsearchapi.blogspot.com/2006/10/custom-search-engine-support.html

While writing this blog post, I realized it would take me forever to go over the number of tricks one can pull with Co-op. Instead, I’ll summarize some of the big selling point features to encourage everyone to start hacking away. Also, to help jump start power users, I’ve linked the XML files I used to make my featured search examples at the bottom of this post.

Key Feature Summary (in no particular order):

and much much more (especially for power users).

If you need a search engine for your site, and your content has been indexed by Google, then seriously consider using this rather than building your own index – or worse, using the crappy full-text functions available in relational databases.

Here are my XML files:

ml-context.xml

ml-pop-context.xml

ml-complx-context.xml

ml-source-context.xml

tech-stuff-context.xml

techreviews.xml

techforums.xml

techshopping.xml

techblogs.xml

technews.xml

tech-stuff-scoble-annotations.xml

tech-stuff-matcutts-annotations.xml

Happy Coop hacking!

55 Comments

Filed under Blog Stuff, Databases, Google, Tagging, Tutorial

SQL Text Mining

One of the projects Jim Gray and I worked on this summer was classifying the types of SQL users ask on the SkyServer site ( http://cas.sdss.org/dr5/en/ ). We were surprised that we could not find any existing research that could describe methods on how to break down the SQL for categorization – especially considering the number of websites and database workloads that bookkeep query logs. Below is a link to the powerpoint presentation I gave at MSR Mountain View last week which describes how we analyzed the SQL. Notable features include text processing strategies, clustering algorithms, distance functions, and two example applications (Bot detection and Query recommendation). We plan to publish our algorithms and results in a technical report in the next month or so – but for now, enjoy the .ppt. As always, comments are more than welcome.

SQL Text Mining Presentation

Creative Commons License

1 Comment

Filed under AI, CS, Databases, Machine Learning, Research