How Google is putting us back into the Stone Age

Yeah, I know – what a linkbait title. If that’s what it takes these days to get visitors and diggs then so be it. Also, just to forewarn, as you read this you might find that a better title choice for this post would have been “How Web 2.0 is putting us back into the Stone Age” since many of these thoughts generalize to Web 2.0 companies as a whole. I used Google in the title mainly because they are the big daddy in the web world, the model many web 2.0 companies strive to be like, the one to beat. Plus, the title just looks and sounds cooler with ‘Google’ in it.

Here’s the main problem I have with web applications coming from companies like Google: About 2 years ago I bought a pretty good box – which is now fairly standard and cheap these days – 2 gigs of ram, dual core AMD-64 3400+’s, 250 gigs hd, nVidia 6600 GT PCI Express, etc. It’s a beast. However, because I don’t play games, its potential isn’t being utilized – not even close. Most of the applications I use are web-based, mainly because the web provides a medium which is cross platform (all machines have a web browser), synchronized (since the data is stored server side I can access it from anywhere like the library, friend’s computer, my laptop) and it keeps my machine pretty light (no need to install anything and waste disk and risk security issues). The web UI experience for the most part isn’t too bad either – in fact, I find that the browser’s restrictions force many UI’s to be far simpler and easier to use. To me, the benefits mentioned above clearly compensate for any UI deficiencies. Unfortunately, this doesn’t mean that Web 2.0 is innovating the user’s experience. Visualizing data – search results, semantic networks, social networks, excel data sheets – is still very primitive, and a lot can be done to improve this experience by taking advantage of the user’s hardware.

My machine, and most likely yours, is very powerful and underutilized. For instance, my graphics card has tons of cores. We live in an age where GPU’s like mine can sort terabytes of data faster than the top-of-the-line Xeon based workstation (refer to Jim Gray’s GPUTerasort paper). For sorting, which is typically the bottleneck in database query plans and MapReduce jobs, it’s all about I/O – or in this case, how fast you can swap memory (for example, a 2-pass bitonic radix sort iteratively swaps the lows and the highs). Say you call memcpy in your C program on a $6,000 Xeon machine. The memory bandwidth is about 4 GB/s. Do the equivalent on a $200 graphics co-processor and you get about 50 GB/s. Holy smokes! I know I’m getting off-topic here, but why is it so much faster on a GPU? Well, in CPU world, memory access can be quite slow. You have almost these random jumps in memory, which can result in expensive TLB/cache misses, page faults, etc. You also have context switching for multi-processing. Lots of overhead going on here. Now compare this with a GPU, which has the memory almost stream directly to tons of cores. The cores on a GPU are fairly cheap, dumb processing units in comparison to the cores found in a CPU. But the GPU uses hundreds of these cores, in parallel, to drastically speed-up the overall processing. This, coupled with its specialized memory architecture, results in amazing performance bandwidth. Also, interestingly, since these cores are cheap (bad), there’s a lot of room for improvement. At the current rate, GPU advancements are occurring 3-4x faster than Moore’s law for CPU’s. Additionally, the graphical experience is near real-life quality. Current API’s enable developers to draw 3D triangles directly off the video card! This is some amazing hardware folks. GPU’s, and generally this whole notion of co-processing to optimize for operations that lag on CPU’s (memory bandwidth, I/O) promise to make future computers even faster than ever.

OK, so the basic story here is our computers are really powerful machines. The web world doesn’t take advantage of this, and considering how much time we spend there, it’s an unfortunate waste of computing potential. Because of this, I feel we are losing an appreciation for our computer’s capabilities. For example, when my friend first started using Gmail, he was non-stop clicking on the ‘Invite a friend’ drop-down. He couldn’t believe how the page could change without a browser refresh. Although this is quite an extreme example, I’ve seen this same phenomena for many users on other websites. IMHO, this is completely pathetic, especially when considering how powerful client-end applications can be in comparison.

Again, I’m not against web-based applications. I love Gmail, Google Maps, Reader, etc. However, there are applications which I do not think should be web-based. An example of this is YouOS, which is an OS accessible through the web-browser. I mean, there’s some potential here, but the way it’s currently implemented is very limiting and unnecessary.

To me, people are developing web-services with the mindset ‘can it hurt?’, when I think a better mantra is ‘will it advance computing and communication?’. Here’s the big web 2.0 problem: Just because you can make something web 2.0’ish, doesn’t mean you should. I think of this along the lines of Turing Complete, which is a notion in computer science for determining whether a system can express any computation. Basically, as long as you can process an input, store state, and return an output (i.e. a potentially stateful function), you can do any computation. Now web pages provide an input form, perform calculations server side, and can generate outputting pages – enough to do anything according to this paradigm, but with extreme limitations on visualization and performance (like with games). AJAX makes web views richer, but it is not only a terribly hacked up programming model, but for some reason compels developers to convert previously successful client-end-based applications into web-based services. Sometimes this makes sense from an end-user perspective, but consequently results in dumbing down the user experience.

We have amazing hardware that’s not being leveraged in web-based services. Browsers provide an emulation for a real application. However, given the proliferation of AJAX web 2.0 services, we’re starting to see applications only appear in the browser and not on the client. I think this current architecture view is unfortunate, because what I see in a browser is typically static content – something I could capture the essence of with a camera shot. In some sense, Web 2.0 is a surreal hack on what the real online experience should be.

I feel we really deserve truly rich applications that deliver ‘Minority Report’ style interfaces that utilize the client’s hardware. Movies predating the 1970’s predicted so much more for our current state’s user experience level. It’s up to us, the end-consumer, to encourage innovation in this space. It’s up to us, the developer, to build killer-applications that require tapping into a computer’s powerful hardware.The more we hype up web 2.0 and dumb-downed webpage experiences, the more website-based services we get – and consequently, less innovation in hardware driven UI’s.

But there’s hope. I think there exists a fair compromise between client-end applications and server-side web services. Internet is getting faster, the browser + Flash are getting fine tuned to make better use of a computer’s resources. Soon, the internet will be well-suited for thin-client computing. A great example of this already exists today, and I’m sure many of you have used it: Google Earth. It’s a client-end application – taking advantage of the computer’s graphics and processing power to make the user feel like he/she is traveling in and out of space – while being a server-side service since it gathers updated geographical data from the web. The only problem is there’s no cross-platform, preexisting layer to build applications like this. How do we make these services without forcing the user to do an interventionist, slow installation? How do we make it run over different platforms? Personally, I think Microsoft completely missed the boat here with .NET. If MS could have recognized the web phenomena early on, they could have build this layer into Vista to encourage users to develop these rich thin-client applications, while also promoting Vista. I have no reason to change my OS – this could have been my reason! Even if it was cross platform, if they had better performance it’s still a reason to prefer (providing some business case). Instead, they treated .NET as a Java-based replacement for MFC, thereby forcing developers to resort to building their cross-platform, no-installation-required services through AJAX and Flash.

Now, even if this layer existed, which would enable developers to build and instantly deploy Google Earth style applications in a cross-platform manner, there would be security concerns. I mean, one could make the case that ActiveX attempted to do this – allowing developers to run arbitrary code on the client’s machines. Unfortunately, this led to numerous viruses. Security violations and spyware scare(d) all of us – so much so that we now do traditionally client-end functions through a dumb-downed web browser interface. But, I think we made some serious inroads in security since then. The fact that we even recognize security in current development makes us readily prepared to support such a platform. I am confident that the potential security issues can be tackled.

To make a final point, I think we all really need higher expectations in the user experience front. We need to develop killer applications that push the limitations of our hardware – to promote innovation and progress. We’re currently at a standstill in my opinion. This isn’t how the internet should be. This is not how I envisioned the future to be like 5 years ago. We can do better. We can build richer applications. But to do this, we as consumers must demand it in order for companies to have a business case to further pursue it. We need developers to come up with innovative ways of visualizing the large amounts of data being generated with the use of hardware – thereby delivering long-awaited killer-applications for our idly computers. Let’s take our futuristic dreams and finally translate them into our present reality.

Google Co-op just got del.icio.us!

Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.

This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:

‘linear regression” on my Machine Learning Search Engine

… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:

It’s simple and clean
You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
Can have bundles (two level tag hierarchies)
Can see who else has bookmarked the site (can also view their comments); builds a user community
Generates a public page serving all your bookmarks

Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.

Ok, enough talk, here’s the final product:
http://basundi.com:8000/login.html

If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).

Here’s a quick feature list:

Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
Auto-generates the CSE context XML
Auto-generates facet titles
Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
- The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
- The other option returns the user’s most popular del.ico.us bundles and corresponding tags
- Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
Provides help documentation links at key steps
And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!

In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.

Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).

Google Co-op — An Intro & Some Insider Hacks

http://www.google.com/coop

So what is it? It’s called Google Co-op, a platform which enables users to build their own vertical search engines and make money off the advertisements. It provides a clean, easy interface for simple site restrictions (like what Yahoo! Search Builder and Live Macros offer) plus a number of power user features for tweaking the search results. The user has control over the look and feel (to embed the search box on their own site), can rank results, and even (multi) tag sites to let viewers filter out results by category.

But talk is cheap. So let me show you some examples of what you can do with Co-op:

http://vik.singh.googlepages.com/techstuff

This is a technology specific search engine, which lets users refine results based off Google Topics (global labels which anyone can annotate with). Basically, I was lazy here. I didn’t feel like multi-tagging sites/domains individually, so instead I just collected a laundry list of popular technology site domains in a flat file and pasted it into Google Co-op’s Custom Search Engine control panel/sites page. In addition, something I think is really useful, Google Co-op allows users to bulk upload links from OPML files. So, to make my life easier when building this, I uploaded Scoble’s and Matt Cutt’s OPML’s. Tons of great links there (and close to 1000 total). Then I clicked on the ‘filter results to just the sites I listed’ option (which I recommend you use since if you muddle your results with normal Google web search’s you typically won’t see your results popping up on the first page of results despite the higher priority level promise for hand chosen sites). To enable the filters you see on the results page (Reviews, Forums, Shopping, Blogs, etc.), I did an intersection with the background label of my search engine and the Google Topics labels. How do you that? The XML context configuration exposes a <BackgroundLabels> tag. Any labels listed in the BackgroundLabels block will be AND’ed (how cool is that). So I added the label of my search engine (each search engine has a unique background label – it can be found bolded on the Advanced Tab page) and a Google Topic label (News, Reviews, Stores, Shopping_Comparison, Blogs, Forums, etc.) in the BackgroundLabels XML block. I made a separate XML context file for each Google Topic intersection. By doing this, I didn’t have to tag any of my results and was still able to provide search filters. Google Topics does most of the hardwork and gives me search refinements for free!

But say you’re not lazy. Here’s an example of what you can do with multi-tagging and refinements.

http://vik.singh.googlepages.com/machinelearningsearch2

This one is more of a power user example – notice the refinements onebox on the search results page, and the labels with “>>” at the end. These labels redirect to another label hierarchy (a hack, I used the label redirect XML option to link to other custom search engine contexts – basically I’m nesting search engines here)

Now, say you want to get fancy with the search results presentation. Here’s a way to do it with Google’s Ajax Search API:

http://www.google.com/uds/samples/cse/index.html

Thanks to Mark Lucovsky and Matt Wytock for developing that great example.
For more information about how to use the Ajax Search API with Custom Search, please take a look at this informative post: http://googleajaxsearchapi.blogspot.com/2006/10/custom-search-engine-support.html

While writing this blog post, I realized it would take me forever to go over the number of tricks one can pull with Co-op. Instead, I’ll summarize some of the big selling point features to encourage everyone to start hacking away. Also, to help jump start power users, I’ve linked the XML files I used to make my featured search examples at the bottom of this post.

Key Feature Summary (in no particular order):

Make money (get a share off the ad clicks)
Multi-label sites
Have up to 5000 annotations
Can collaborate with friends to tag sites (I’ve made my search engines public so anyone can add their annotations)
Can associate weights to results and have control over the rankings (refer to the <score> tag in the XML)
Completely brand the engine and customize the look and feel
- Ex. http://vik.singh.googlepages.com/techstuff
Can display refinements on the search results page
- Ex. http://vik.singh.googlepages.com/machinelearningsearch2
Can upload XML/OPML files filled with links
Can upload XML to configure the search engine labels/contexts
Can even nest label hierarchies on the refinements onebox
- Ex. http://vik.singh.googlepages.com/machinelearningsearch2 .. click on Sources
Can make the search Ajax’y
- Ex. http://www.google.com/uds/samples/cse/index.html
Can combine results with another person’s search results
- (by intersecting the background labels in the XML advanced configuration file)

and much much more (especially for power users).

If you need a search engine for your site, and your content has been indexed by Google, then seriously consider using this rather than building your own index – or worse, using the crappy full-text functions available in relational databases.

Here are my XML files:

ml-context.xml

ml-pop-context.xml

ml-complx-context.xml

ml-source-context.xml

tech-stuff-context.xml

tech-stuff-scoble-annotations.xml

tech-stuff-matcutts-annotations.xml

Happy Coop hacking!

SQL Text Mining

One of the projects Jim Gray and I worked on this summer was classifying the types of SQL users ask on the SkyServer site ( http://cas.sdss.org/dr5/en/ ). We were surprised that we could not find any existing research that could describe methods on how to break down the SQL for categorization – especially considering the number of websites and database workloads that bookkeep query logs. Below is a link to the powerpoint presentation I gave at MSR Mountain View last week which describes how we analyzed the SQL. Notable features include text processing strategies, clustering algorithms, distance functions, and two example applications (Bot detection and Query recommendation). We plan to publish our algorithms and results in a technical report in the next month or so – but for now, enjoy the .ppt. As always, comments are more than welcome.

SQL Text Mining Presentation

Hierarchical Multi-Labeling Presentation Slides

Below is a small class talk I gave on the hierarchical multi-labeling classification framework I outlined in my previous ‘Future of Tagging’ posts. I did a small experiment classifying tech news articles as Pro/Anti- Microsoft/Google (along with some other tags like the tech category and whether the article is a blog or publication based off the text of the piece). The results are very promising – even with such a small corpus of training documents the classifier performed very well. I do have some ideas on how to further improve accuracy, so when time cycles free up I’ll add those to the code and rerun it on a bigger and better (in terms of tag structure) training data set. By then I’m hoping the code will look less embarassing for public release and the results will be more conclusive – but until that day here are the presentation slides:

http://thumbstacks.com/play.html?show=b8f3f0c63f62631d9c5c77e33aba322a

The Future of Tagging – Part II

Note: Refers to the ideas described in the original post

An Algorithmic Update

Just wanted to let people know that I’ve changed my algorithms/framework for hierarchical mult-labeling classification quite a bit. One thing that really bugged me about my initial idea was the error correction scheme – i.e. sampling the tag network (a bayesian/mrf hybrid) for closely related bitstrings. All the SAT/conditional probability table values in this network are generated from the number of times tags occur together in the training data, thus making my error correction scheme a popularity contest. But what about the feature values? We SHOULD take these values into account and try to reduce our new input down to a training data example with closely related feature values THAT also happens to have a similar tag bitstring (based off the prediction string outputted by the binary classifiers).

With regards to assuming there are k errors in the bitstring (call it b) we get back from the classifiers – before we sampled new candidate bitstrings based off the bitpattern produced after randomly unsetting k bits in b. Instead, since many classifiers (like the support vector I’m using) can return a probability confidence associated to the 0/1 output, my new algorithm chooses the k positions to unset not uniformly at random, but rather with a bias towards the bits with the smallest probabilities (since they are most likely the erroneous ones according to the classifiers).

Another thing I added were two tag normalization rules for determining how to choose labels:

No more than one tag from each tree/hierarchy
Each tag must be a leaf node in a tree

Why the rules? It provides some level of control for the placement and generality of the tags. The first one ensures there’s some separation/disjointness among the tags. And for the second – I was afraid of mixing general and very specific tags together in a grouping because it could hurt my learner’s accuracy (since the tags/labels are not on the same par). By forcing tags to be leaf nodes in the trees we sort of normalize the labels to be on the same weighted level of specificity.

Another note – when generating the tag binary classifiers, I originally proposed just taking all the files/features that map to a label grouping that contains that tag (set as the y=1 cases in the binary classifier’s training data model) and all the files/features that map to a grouping that does not contain the tag for the y=0 cases. However, this splitting up of the data seems likely to produce many bad/unnecessary features since (1) there can be a LOT of 0 cases and (2) 0 case files/examples can deal with ANYTHING, inducing their completely irrelevant features to the tag’s binary classifier’s model. But we have a way out of this dilemma thanks to the tag normalization rules above – since we can only choose a single tag from each tree, we can use all the inputs/files/training data examples that map to other leaf-node tags in the SAME tree for the zero cases. This selection of 0 case files scopes the context down to one label hierarchy/tree that contains the tag we’re trying to learn.

Anyway, I’ll try to post the pseudo code (and actual code) for my algorithms and some initial experimental results on this blog shortly. Additionally, expect a tutorial describing the steps/software I used to perform these tests.

Are professors too paper-happy?

[Update (4/26/2006): Motivation]

I've received several emails/comments (mostly from researchers and professors) regarding this post and realized some may have misconstrued my intentions for writing this article. I don't blame them. The title sounds pretty controversial and this post is quite long – compelling readers to skim and consequently miss some of the points I'm trying to make. As I mention in the second paragraph, 'I'm a huge fan of publications. The more the better.' All I'm trying to do here is make the case for why I think researchers should ALSO use more informal avenues of communication, such as blogs, for getting their ideas out for public review/commenting. These sources would not only increase reach/access, but serve as great supplementary material to their corresponding conference papers – NOT as replacements or alternatives to publications. I received a very insightful idea from Hari Balakrishnan – What if authors, in conjunction with their papers, release informal 5-6 page summaries of their projects/ideas for online public review/commenting? I would love to have resources like these available to me when dissecting the insight/knowledge encapsulated in formalized papers. I think simple ideas like these would significantly improve reachability/access of our research and even encourage more creativity/questioning.

[End Update]

On Digg:

http://digg.com/science/Are_professors_too_paper-happy_

I recently read through a professor's CV, which under the 'Publications' section stated "… published over 100 conference papers." I then proceeded to the next page of the resume to scan through some of his paper titles, only to see a brand new section titled 'References'. Wait, where'd his publications go? I went back to see if I missed a page. Nope. Wow … why wouldn't he include his citations or mention notable pieces of work? I mean, saying you have 100+ publications gives me no value unless you're going to list them. Granted, that's an amazing accomplishment – writing a paper that advances (cs) research is not only time consuming and hard work, but it requires a whole new level of creativity. Additionally, conferences are getting super competitive (unless he's publishing in these conferences) – I hear many cases of famous, tenured professors/researchers getting their papers rejected. Now multiplying this paper process by 100 represents a significant amount of research, so I'm willing to bet this guy is pretty bright. However, this one line publication description gives me the impression that this professor aims for quantity in order to amplify his prestige. I also get this feeling after publishing the 100th paper he reached one of those hard set goals that gives one the sense of 'Mission Accomplished'. I guess I'm different – I'm all about quality. Personally, I'd be content with 2 publications in my life if one of 'em got me the Turing Award and the other a date with Heidi Klum – but that's just me 🙂

Interestingly (or oddly), this CV publication description also got me thinking about some of the things I hate about (cs) papers. But first, let me make it very clear – I am a huge fan of conference publications. The more the better. The peer review process of research papers is absolutely critical to advancing science. In this post, however, I would like to make the case for why I think all researchers should ALSO use more informal avenues of communication (such as blogs, wikipedia, etc.) for publicly broadcasting their ideas and results of their latest research.

So let's start this off with a laundry list of gripes I have about 'most' papers (with possible alternatives/solutions mixed in):

PDF/PS/Sealed/doc/dvi/tex formats

That load time for the viewer is killer – probably deters many from even clicking the link
Certain viewers render diagams/text differently
The document looks formal, intimidating, elitist, old, plain, hardcore, not happy to see you
Also makes documents feel permanent, finalized, not amenable to changes
Provides no interface for commenting by readers – and in many cases I find reader critiques more interesting/useful than the actual paper
And why can't we see the feedback/comments from the conference?
Not pretty – It's a lot of material to read, so why not make the presentation look happier? I seriously think a nice stylesheet/web layout for paper content would significantly improve readability and portability

It's a lot of work

Not only does one need to come up with a fairly original idea, but research, discuss, and analyze its empirical/theoretical implications
Needs support/citations
Papers are quite formulaic – I can easily guess the page that has the nice/pretty graphs
This structure imposes pretty big research burdens on the authors
This is a GOOD thing for paper quality
But a terrible method for prototyping (similar to how UI dev's quickly prototype designs to cycle in user feedback)
Doesn't provide professors/researchers a forum to quickly get ideas out nor to filter in comments
Also prevents researchers from spewing their thoughts out since they wish to formalize everything in papers
Now there are other journals/magazines which are informal to let researchers publish their brainstorming/wacky idea articles
But there's still a time delay in getting those articles published
They still have writing standards and an approval process
And I don't read them (Where do I find good informal CS articles online? Is anyone linking to these? Who's authoring them?)
In blogs, authors can be comedic, speak freely ("he's full of it", "that technique is a load of crap" – opinions like these embedded in technical articles would making reading so much more enjoyable), and quickly get to the point without having to exhaust the boring implementation details

Access

Although papers normally include author emails, this means feedback is kept privately in the authors' inboxes, not viewable to the general public
Many papers/journals require subscriptions/fees

Prestige

The main issue here is professors/researchers want to publish in prestigious papers/journals
Rather than waste their time with things like weblogs, which are perceived to be inherently biased and non-factual, and where the audience may seem 'dumber'
It's in their best interest to focus on publications – get tenure, fame, approval from experts in the field

I want informality

But it sucks for us ('us' being the general audience who may wish to learn about these breakthroughs)
I love hearing professors ramble off their ideas freely
I want to see commentary from readers and myself ON these articles
And I want to see these ideas get posted and updated quickly
I want to see these experts explaining their ideas in simple terms (whenever possible, if it's too hardcore then it's probably not very good bloggable material) and describe the real world impacts/examples
But unfortunately NONE of my favorite professors/researchers publish their ideas on blogs

Popularity and Reach

There isn't much linkability happening for papers (besides from class/professor web sites)
You don't see conference papers getting linked on Slashdot/Digg
If millions of geeks don't even want to link/read this stuff, how and why would others? (refer below to the concluding section for why I think this is important)
Papers are formal, old-school, and designed to be read by expert audiences
They are written to impress, be exhaustive/detailed, when in reality most of us are lazy, want to read something entertaining, and get to the dang point
Wouldn't it be nice if professors/researchers expressed their ideas/research in a bloggish manner whenever possible?
At best a blog post would be a great supplemental/introductorial reading to the actual paper
Even the conference panel can check the blog comments to see if they missed anything
Some professors express their ideas informally with lecture notes and powerpoint presentations
But again, these formats don't let others annotate it with their comments
They are mostly pdf/ps/ppt/doc's (ah that load time)
And lecture notes usually exist for core concepts, not experimental/recent ideas

Wikipedia

Which brings me up another interesting idea
Where's THE place to learn about core, fundamental cs topics?
Or even slightly maturing cs algorithms/techniques?
I tend to follow this learning process:
- Check Wikipedia
- At best gives me a great introduction to a topic
- If i need more, I use what I learned from wikipedia/other sources to build queries for searching lecture notes/presentations
- Typically the interesting details are hidden in a set of lecture notes/papers/books
- And which of these sources should I use? – I have to read many versions coming from different professors to piece together a clear story
Wouldn't it be nice to have a one stop shop
Wikipedia!
What if Universities MOTIVATED professors/researchers to collaborate on wikipedia to publish in-depth articles regarding their areas of interest?
This would be huge
Wikipedia is incredibly popular/useful/successful, so let's use it as the place where professors around the world can unite their knowledge
Would be the best supplement (even replacement) for lecture notes
And for more experimental/controversial topics, researchers can use individual blogs

A slight digression and concluding remarks: The Future of Computer Science Research

Two things:

I strongly believe the future of (computer) science relies on making our stuff more accessible to others – requiring us to tap formal AND informal avenues of communication
We need to be more interdisciplinary!

Many people (especially prospective computer science students) ask me:

What's left to invent?

Most research to them (and me) sounds like 'optimizations' based work
Is there anything revolutionary left to do? Has it all been accomplished?
This is a hard question to answer convincingly, especially since if there was something revolutionary left to do, someone (hopefully me) would be on it
It's also difficult to forsee the impacts of ongoing research
Big ideas/inventions are evolutionary, piecing together decades of work.

They even say … "if only I could go back in time and invent the TV, PC, Internet, Web Search …"

As if back in the day we were living in the stone age and there was so many things left to do
There are even more things left to do today!

Our goal should not be to come up with an idea that surpasses the importance of say the invention of the PC

(I'm not even sure if this is possible, and at best is an ill comparison to make)
But rather to learn about the problems people face NOW and use our knowledge of science to solve them
Research should be entrepreneurial: Find a customer pain and provide a method to solve it
Not the other way around: Playing with super geeky stuff for fun and hoping later it might have applications (there are exceptions to this, since it encourages depth/exploration of a field which may lead to an accidental discovery of something huge, but I still think for the most part this is the wrong way to go about research)
Your audience is the most important element of research
And the beauty of our field – IT is a common theme in every research area!
We're the common denominator!
We can go about improving research in any subject

Now getting back to focus …

We consider the PC, Internet, etc. to be revolutionary because they are intuitive, fundamental platforms
Current research adds many layers of complexity to these ideas in order to solve more difficult problems
Making things more and more complex
What we need to be doing better is making our complex systems easier to use
We need better integration
And expand our research applications into other fields, rather than just reinforcing them back into computer science areas
We need to be interdisciplinary!
We have yet to even touch the surface when it comes to exploring the overlaps of fields

Some of the things we've been brewing since the 1970's could seriously REVOLUTIONIZE other industries

Imagine machine learning a database filled with all past medical histories/records/diagnostic reports
- So when presented with symptoms/readings of a new patient, the system will tell you what that patient is suffering from with high probability based off analyzing millions of records
- This would dramatically decrease the number of cases of death due to bad diagnosis and lower medical/insurance costs since it wouldn't be necessary to run a bunch of expensive scans/tests and surgeries to figure out what's wrong with a patient
A similar system could do the same thing over economics, disease, environmental, hazards datasets so that scientists and policy makers can ask questions like 'In the next five years what region of the world has the highest probability of a natural disaster … or a disease epidemic?, etc.'
- This would be huge in shaping policy priorities, saving human lives, and preparing/preventing disasters like Katrina from ever escalating
Or what about mobile ad hoc networks to let villages in the poorest regions of the world wirelessly share an internet connection
Or sensor networks to do environmental sensing, health monitoring in hospitals, traffic control, pinpointing the location of a crime, etc.
And plenty, plenty more

The attractive elements of computer science research is not necessarily the algorithms, nor the programming implementation

But rather the intelligence/knowledge it can provide to humans after partying on data
Information is what makes a killer application!
And every field has tons of interesting information
We need to INTERACT and learn more about the needs in other fields and shape our research accordingly
(Unfortunately, computer scientists have a somewhat bad reputation when it comes to any form of 'interaction')

So after hearing all this, the prospective science students complain:

"Now we have to learn about multiple fields – our research burden is so much bigger than the scientists' in the past"
This is a very silly argument for defending one's laziness to pursue (computer) science
I mean I much rather hear one of them 'I don't want to get outsourced to India' excuses 🙂
Here's why:
- We got it GOOD
- I mean, geez, we got Google!
- I can't even begin to imagine what Einstein, Gauss, Turing, Newton, etc. could have accomplished if they had the information research tools we got
- We could have flying cars, cleaner environment, robots doing our homework
- Heck, they could have probably of found a way to achieve world peace
- We take what we have for granted
Back in the day, where there was no email, no search, no unified, indexed repository of publications
Scientists had to do amazing amounts of research just to find the right sources to learn from
Travel around the world to meet with leading researchers
Read entire books to find relevant information
You see, scientists back then had no choice but to be one-dimensional, and those who went deep into different fields either had no lives or were incredibly brilliant (most likely both)
But we can do all this work they had to do just to prepare for learning in a matter of seconds, from anywhere!
It's utter crap to think we can't learn about multiple fields of interest, in-depth
And our research tools will only get better and better exponentially faster
But I don't blame these students for their complaint
We're still trained to be one-dimensional
Universities have specific majors and graduate programs (although some of these programs are SLOWLY mixing together)
Thereby requiring many high school students to select a single major on the admissions application
Even though their high school (which focuses on breadth) probably did a terrible job introducing them to liberal arts and science
[Side Note: Actually, this is one of things I never quite understood. How do students choose a major like EE/NuclearEng/MechEng on the admissions application, when their high school most likely doesn't offer any of those courses? Sounds like their decision came down to three factors (1) The student is a geek (2) Parent's influence (3) Money. 2/3 of these are terrible reasons to enter these fields. Even worse, many universities separate engineering from liberal arts, making it almost impossible for a student who originally declared 'Economics' to switch to BioE after taking/liking a biotech class – despite the fact that their motivation to enter the major is so much better. You should enter a field you love, and making the decision on which students can enter engineering majors based off high school, which gives you almost zero exposure to these fields, makes no sense. And we wonder why we have a shortage of good engineers … well one reason is probably because we don't let them into the university programs – 'them' being the ones who actually realized they liked this stuff when they got to college.]
Many job descriptions want students/applicants with skills pertaining to one area

But unfortunately the problems we now face require us to understand how things mix together

This is why it's very important we increase the levels communications between fields to promote interdisciplinary research
Like a simple example: Databases and IR
- Both have the same goal: To answer queries accurately regarding a set of data
- Despite this, they are seen as two traditionally separate areas
- Database/IR people normally work on different floors/buildings at companies/universities
- One system is based traditionally more on logic while the other is more statistical/probablistic
- But exploiting their overlap could greatly improve information access (a la SQL Server 2005)
- Notice this example refers to two subtopics under just computer science, not even two mainstream fields!
Just imagine the impacts of collaborating more with other fields of study

This is why I feel blogging, wikipedia, and informal avenues of communicating our thoughts/research/ideas is very important

So that people in other fields can read them
And for the laundry list of reasons above
Even if it's just a proof of concept technique with no empirical studies
An interested geek reader who came across the blog post might just end up coding it up for you
Could even inspire start-up companies

So the POINT: Do whatever you can to voice your ideas/research to the largest possible audience in the hopes of cross fertilizing research in many fields.

* So what's up with the parentheses around 'cs' and 'computer' in this post? Well, many of these points generalize to all (science) research publications 🙂

The Future of Tagging

Update (9/25): A Short Presentation On This (forgot to link)
Update (5/06): An Algorithmic Update
Update (3/29): Added a Motivation section

Motivation

After reviewing the comments/emails I’ve received so far, I realized my article could make the motivation clearer. As I’ve mentioned in this post/comments, I agree with everyone regarding basic tagging – it’s by far the simplest way for any user to organize their media/text so that others can search/relate their content. I don’t see this ever going away. However, I decided to take a step back in this article and look at the issues with the current tagging model and examine an alternative method, namely hierarchical multi-labeling. Hierarchical multi-labeling solves many of the issues concerning basic tagging and should lend to better auto-tagging algorithms since it tells us how tags are related to each other. I definitely agree this isn’t something we should expect the average user to perform – but I do think power users and content aggregators like Google News could benefit greatly from this tagging model.

One of my goals this semester is to let anyone (most likely a hardcore geek) pass a tag network and training data (both represented as files) to my web service and I generate for them a classifier page (with a search box that takes in a webpage link or a string of words). Click ‘Classify’ and it’ll return the best set of tags for that input based off the training data examples. Services can use this to classify their news articles, emails, bookmarks, etc. Notice the grunt work (which isn’t too bad) is done by a geek, but all users can benefit from such a system.

Anyway, hope you enjoy the read and please comment if possible.

On Digg:

http://digg.com/software/Why_current_tagging_sucks,_and_how_to_fix_it

Two ways we can improve upon tagging:

Support label hierarchies/groupings
Use multiple (but ‘normalized’, explained below) labels per object

Why Hierarchical Tags?

Many app’s make use of a single level hierarchy of tags

Ex’s: Flickr, Gmail, Google Base, Technorati, WordPress, Delicious, YouTube
Big buzz around tagging
Finally provides users a simple way to structure their unstructured content (typically text, photos, video, and music media)
Makes it easier to search for related items

But one level hierarchy has issues

Namespace – redundant/similar named tags treated differently – wasteful
Not normalized – tags are not equal, general and specific in the same level
Loses relationships among different tags (how is tag x and tag y causally related?)
One level hierarchy labels are really just keywords
Not much different than a concise textual description
Only provides structural support if multiple documents use same tags
In it’s current form, it’s an unstructured way of structuring unstructured content
But better than nothing

Very simple Ex: Gmail has a one level hierarchy of tags

I get an email regarding my CS294 class, so I tag it with ‘CS294’
However, I would also like to be able to search through all emails
relevant to school, so I have to tag it with label ‘Coursework’
But ‘Coursework’ encompasses ‘CS294’
I have to redundantly add Coursework to each CS294 tagged email even though ‘CS294’ implies ‘Coursework’
I could label it ‘Coursework\CS294’, but that’s just one tag specific to
CS294, I can’t separate out the ‘Coursework’ part for search
This slashing technique, popularized in many Delicious sites, provides hierarchy in name only
- Fails to capture any practical benefits like searching or relating different hierarchies

Why Multi-Label?

Objects typically (and should) fall under several categories

Many cool AI applications/data representations motivate multiple labels:

Medical Diagnosis
- Real life Dr. House without the attitude
- Normally many causes/diseases associated to a set of health features
- Help diagnosticians to narrow down on the most likely set of causes

Computer Vision (i.e. Flickr, Riya, espgame.org, peekaboom.com)
- For ex. Espgame/Peekaboom collect many labels for images/pixels
- Could use their data to train computer vision learners for auto-tagging

Email/Filesystems/Databases/Search (i.e. Gmail, WinFS, SQL, Google Base)
- File/Directory concept outdated
- ‘Files’ are really objects which have metadata and relational aspects
- Multi-Labels present a great, simple way to structure the diverse unstructured content in a file
- (side-note: hierarchical tags could be used to provide backwards compatibility with File/Directory)

News (i.e. Google News, Digg, CNN, NY Times, Slashdot, News.com)
- Multiple (hierarchical) labels for each news piece
- Like seeing these labels { News.Tech.Computers.Hardware; Ideology.Geeky.Anti-Microsoft; Source.Online.Blog; People.Bill-Gates }
  … tells me a ton about an article before even reading it
- Plus I can now search/relate these tags to find similar news articles based off specific attributes

Let’s get a bit more technical

We organize labels into trees (gives us hierarchies)

Per object, we choose multiple labels if each label comes from a different tree (hence ‘normalized’, provides some degree of separation/independence of tags)

So, what’s the point of adding all this complexity to tagging

One of the nice benefits of tagging is it’s so simple

I agree: I’m not expecting mommy and daddy to do hierarchical multi-labeling

But content providers can do this to reap the benefits described above

AND, because it will help our artificial intelligence algorithms learn how to multi-tag objects automatically (mainly because we know how tags are related to each other)

A possible machine learning algorithm for hierarchical multi-labeling

Design

We’ll build this algorithm based off binary supervised classifiers because:

Well understood in theory & practice; simpler, best accuracy
Many multiclass classifiers actually use several pairwise (all-pairs, one-versus-all, etc) binary classifiers
Many algorithms to work with: Perceptron, Kernels (Support Vector Machines), Neural Nets, Decision Trees, etc.

Want to create a Bayesian network based off the tag trees (actually it’s more like a Markov random field since there are undirected edges between tree nodes which occur together in the training data, annotated with CPT/SAT-based representations describing the causalities)

Ex. of a Tag Network

News

Sports
- Editorial

Ideology

Liberal
- Democrat
- Marxist

Geeky
- Anti-Microsoft

Nationalism
- Brazil
- USA

Athletics

Outdoor
- Baseball
- Football

Indoor
- Soccer

Source

Paper
- Magazine
Online
- Blog
  - Wikipedia

* Does not show (undirected) links between hierarchies (like
News.Sports to Ideology.Nationalism) since it’s hard to show in text

Ex. Training Data

x1=3.45, x2=2.10, x3=5.45, x4=0.20, x5=9.20

y =

News.Sports.Editorial
Ideology.Nationalism.Brazil
Athletics.Indoor.Soccer
Source.Online.Wikipedia

x1=1.25, x2=6.93, x3=3.11, x4=8.01, x5=0.20

News.Tech.Computers.Hardware
Ideology.Geeky.Anti-Microsoft
Source.Online.Blog

How to fill in the CPT values for each bayes node in the Tag network?
We just count the tag groupings in the training data and use these numbers to generate a distribution

Learning

Create a Kernel (Support Vector) machine based binary classifier for each distinct tag
Train each binary classifier with the features from the training data whose y contains the tag (set classifier’s y = 1 for each of these feature sets)
and with features that do not contain the tag (set classifier’s y = 0 for each)
- (side-note: also known as one-versus-all approach, most common multiclass method)

Predicting

We run a new feature set through the set of binary classifiers, which each output a 0 or 1

Now we could just use this bitstring to immediately return a set of tags (the tags associated to 1 bits), and several existing Multi-Label approaches do this, but I think we can do better

The two main issues I have using this bitstring directly:

The binary classifiers treat the tags independent of one another
We don’t know which features correlate to which tags, and to what degree
- Therefore we may be using irrelevant features (since we use them all) for training the binary classifiers, which hurts accuracy

These issues introduce errors in our bitstring b

However, we can use the tag relationships in our bayesian network to correct b

This problem lends nicely to an information theoretical approach

We received b over a noisy channel
Use what we know about tag relationships to reduce error (i.e. Hamming distance from the actual bitstring)

Reducing Error

There are several ways to go about error correcting b, here’s the one I came up with:
(any feedback esp. here would be great)

Use a Gibb’s (MCMC) based sampling scheme to generate candidate bitstrings from b
Nice convergence properties
Not knowing which bits are wrong in b motivates randomized/sampling methods
For n times, randomly choose a bit and flip it proportional to its probability in its CPT, output the new bitstring
This scheme occasionally alternates from sampling new bitstrings based off previously ‘sample generated’ bitstrings and off the original b (could alternate after k iterations, where k is the median hamming distance score from the strings produced by the binary classifiers compared against the actual strings for a hidden training data set)

Now we wish to find ‘interesting’ bitstrings from our set of n

By ‘interesting’ I mean bitstrings with frequent itemsets (related)
Use Apriori data mining algorithm to find these bitstrings, call the returning set s
Then over the bitstrings in s, scan for bits that have the same assignments and for 1 bits

Run a diagnosis (or MAP) query over the tags assigned 1 conditioned on tags assigned the same value in every bitstring, which returns our desired tag assignments

Here’s an Ex.:

Say the bitstrings we get from the Apriori algorithm are:

A B C D E F G
0 1 0 1 0 1 0
0 1 1 1 0 1 0
0 1 0 1 0 0 1

I scan the bits and see A is 0, B is 1, E is 0, in all the bitstrings

I also see C, F, G were assigned 1 in at least one bitstring

So I run this MAP query over my bayes network to find the assignment that maximizes:

Pr(C=?, F=?, G=? | A=0, B=1, E=0)

and return the tags B (since it was assigned 1 everywhere) and whichever ones will be assigned 1 by this query

Bayesian Inference and Diagnosis

Our previous step has left us with a very difficult problem to solve

Just doing plain ol’ bayesian inference queries is #P complete
But MAP queries are even harder since they need to infer the probabilities of all possible assignments

But luckily for me, I wanted to reduce a fun problem down to a MAP query

Let’s me relate this project to some very interesting class material 🙂

Prof. Satish Kumar in CS294 lectured on an exciting method for quickly computing inference/MAP queries

Utilizes several interesting tricks:

Precompiling the bayesian network into an efficient SAT-based representation
AND’ing each DNF clause with the query terms
And then counting the # of solutions to each clause using Karp’s FPRAS algorithm
Adding up the counters (each multiplied by some factor) solves your inference queries!

Can do DNF sampling (Las-Vegas or Atlantic-City style) under the same scheme for computing diagnosis queries

(references to learn more:
http://www.eecs.berkeley.edu/~tksk/classes/s06/handouts/lecture-06.pdf http://www.eecs.berkeley.edu/~tksk/PAPERS/thesis.pdf starting on pg. 179)

So, if we just precompile our tag network (before ever learning/predicting labels) then at runtime we can answer MAP queries in polynomial time w.r.t. the size of the SAT-based representation & exponential w.r.t the size of the largest communication link in our clique tree (an optimization that basically caches variables shared
between family nodes)

Pretty cool!

(A nice side-note property: the more variable assignments we condition on, the more we reduce the size of our SAT-based representation)

Benefits

Everything is Parallelizable!

Counting the tag occurences in the training data to populate the CPT’s in the bayes network (split by record)
Generating the set of binary classifiers and running a new feature set over the binary classifiers (since they are independent)
Induced correction sampling of bitstrings
Apriori is parallelizable
Scanning the bitstrings returned from Apriori for commonalities (split by bitstring, for each tag just output 1/0, then reduce the sum and check it with the total # of bitstrings)
Even Satish’s inference/diagnosis algorithms (split by DNF clause)

Decouples tag relations and learning

Our tag network does not condition on specific feature values
This could be seen as a bad thing but learning the causalities of specific feature values to a label (or group of labels) sounds very difficult/messy & at best application limiting since it assumes we:
1. Know all (or even a subset of) the possible values each of the features can take
2. Have a learning method that maps each of these individual values to a tags distribution

However, binary classifiers implicity find discrepencies in particular features in order to differentiate data inputs
In our model, the user just needs to have tag-specific binary classifiers
They don’t need to incorporate causality with other tags in their one-tag specific learner
- Avoiding MultiClass learners (which aren’t nearly as well-understood nor as accurate as binary classifiers)
- Additionally, our paradigm lets users plug-in their favorite binary classifier (many to choose from, as mentioned in Motivations)

Our model lets users simply construct tag dependency trees without having to explicity show why those links exist

By making the problem more complex we might actually be improving learning accuracy

Tag networks, hierarchal labeling, & multi-labeling – added complex functions to the system – should improve learning
Let’s us now exploit the overlap of tags
Multi-Labeling increases the chances that our learner can predict one, a subset, or all the tags correctly for a new input
- May be a better pay off than the currently popular paradigm of where you either get the label right or you don’t

Want to help the AI community? Play games!

Last week I heard Luis von Ahn‘s talk on CMU’s Human Computation project, which utilizes Human cycles to solve significant AI-Complete problems like Computer Vision and NLP. How do they enlist the help of free human labor? By wrapping these problems into addictive, interactive, web based games. Genius. For example, their ESP Game gives two remote players a series of identical images to label. For each label they both match, the higher their scores.

So, how does this help advance AI research? Well, despite our major breakthroughs in computer hardware and theory, we haven’t quite figured out a robust method to understanding the contents of a picture. This is a big problem because most pictures on the web don’t come with sufficient metadata and often have insensitive file names like ‘3.jpg’. However, unlike machines, us humans can easily categorize a set of images. These labels would significantly help search engines like Google improve their image rankings and even enable users to search for related images.

Other interesting remarks from the talk – In just the past year humans spent over 9 billion hours playing Windows Solitaire. 9 billion! ASDFLKQOW!! To put this into perspective, it took 5 million hours to construct the Empire State building and 20 million hours for the Panama Canal. That’s ridiculous. Luis estimates that if just 5,000 humans played ESP for a 2 month’s time, all the images in Google search engine could be accurately labeled.

So, the moral of this post – if you’re bored and got a few human cycles to spare, try playing these ‘productive’ games.

Vik's Blog

Posts that pay homage to Jim Gray's "Let's party on the data" line.