Category Archives: Google

How Google is putting us back into the Stone Age

Yeah, I know – what a linkbait title. If that’s what it takes these days to get visitors and diggs then so be it. Also, just to forewarn, as you read this you might find that a better title choice for this post would have been “How Web 2.0 is putting us back into the Stone Age” since many of these thoughts generalize to Web 2.0 companies as a whole. I used Google in the title mainly because they are the big daddy in the web world, the model many web 2.0 companies strive to be like, the one to beat. Plus, the title just looks and sounds cooler with ‘Google’ in it.

Here’s the main problem I have with web applications coming from companies like Google: About 2 years ago I bought a pretty good box – which is now fairly standard and cheap these days – 2 gigs of ram, dual core AMD-64 3400+’s, 250 gigs hd, nVidia 6600 GT PCI Express, etc. It’s a beast. However, because I don’t play games, its potential isn’t being utilized – not even close. Most of the applications I use are web-based, mainly because the web provides a medium which is cross platform (all machines have a web browser), synchronized (since the data is stored server side I can access it from anywhere like the library, friend’s computer, my laptop) and it keeps my machine pretty light (no need to install anything and waste disk and risk security issues). The web UI experience for the most part isn’t too bad either – in fact, I find that the browser’s restrictions force many UI’s to be far simpler and easier to use. To me, the benefits mentioned above clearly compensate for any UI deficiencies. Unfortunately, this doesn’t mean that Web 2.0 is innovating the user’s experience. Visualizing data – search results, semantic networks, social networks, excel data sheets – is still very primitive, and a lot can be done to improve this experience by taking advantage of the user’s hardware.

My machine, and most likely yours, is very powerful and underutilized. For instance, my graphics card has tons of cores. We live in an age where GPU’s like mine can sort terabytes of data faster than the top-of-the-line Xeon based workstation (refer to Jim Gray’s GPUTerasort paper). For sorting, which is typically the bottleneck in database query plans and MapReduce jobs, it’s all about I/O – or in this case, how fast you can swap memory (for example, a 2-pass bitonic radix sort iteratively swaps the lows and the highs). Say you call memcpy in your C program on a $6,000 Xeon machine. The memory bandwidth is about 4 GB/s. Do the equivalent on a $200 graphics co-processor and you get about 50 GB/s. Holy smokes! I know I’m getting off-topic here, but why is it so much faster on a GPU? Well, in CPU world, memory access can be quite slow. You have almost these random jumps in memory, which can result in expensive TLB/cache misses, page faults, etc. You also have context switching for multi-processing. Lots of overhead going on here. Now compare this with a GPU, which has the memory almost stream directly to tons of cores. The cores on a GPU are fairly cheap, dumb processing units in comparison to the cores found in a CPU. But the GPU uses hundreds of these cores, in parallel, to drastically speed-up the overall processing. This, coupled with its specialized memory architecture, results in amazing performance bandwidth. Also, interestingly, since these cores are cheap (bad), there’s a lot of room for improvement. At the current rate, GPU advancements are occurring 3-4x faster than Moore’s law for CPU’s. Additionally, the graphical experience is near real-life quality. Current API’s enable developers to draw 3D triangles directly off the video card! This is some amazing hardware folks. GPU’s, and generally this whole notion of co-processing to optimize for operations that lag on CPU’s (memory bandwidth, I/O) promise to make future computers even faster than ever.

OK, so the basic story here is our computers are really powerful machines. The web world doesn’t take advantage of this, and considering how much time we spend there, it’s an unfortunate waste of computing potential. Because of this, I feel we are losing an appreciation for our computer’s capabilities. For example, when my friend first started using Gmail, he was non-stop clicking on the ‘Invite a friend’ drop-down. He couldn’t believe how the page could change without a browser refresh. Although this is quite an extreme example, I’ve seen this same phenomena for many users on other websites. IMHO, this is completely pathetic, especially when considering how powerful client-end applications can be in comparison.

Again, I’m not against web-based applications. I love Gmail, Google Maps, Reader, etc. However, there are applications which I do not think should be web-based. An example of this is YouOS, which is an OS accessible through the web-browser. I mean, there’s some potential here, but the way it’s currently implemented is very limiting and unnecessary.

To me, people are developing web-services with the mindset ‘can it hurt?’, when I think a better mantra is ‘will it advance computing and communication?’. Here’s the big web 2.0 problem: Just because you can make something web 2.0’ish, doesn’t mean you should. I think of this along the lines of Turing Complete, which is a notion in computer science for determining whether a system can express any computation. Basically, as long as you can process an input, store state, and return an output (i.e. a potentially stateful function), you can do any computation. Now web pages provide an input form, perform calculations server side, and can generate outputting pages – enough to do anything according to this paradigm, but with extreme limitations on visualization and performance (like with games). AJAX makes web views richer, but it is not only a terribly hacked up programming model, but for some reason compels developers to convert previously successful client-end-based applications into web-based services. Sometimes this makes sense from an end-user perspective, but consequently results in dumbing down the user experience.

We have amazing hardware that’s not being leveraged in web-based services. Browsers provide an emulation for a real application. However, given the proliferation of AJAX web 2.0 services, we’re starting to see applications only appear in the browser and not on the client. I think this current architecture view is unfortunate, because what I see in a browser is typically static content – something I could capture the essence of with a camera shot. In some sense, Web 2.0 is a surreal hack on what the real online experience should be.

I feel we really deserve truly rich applications that deliver ‘Minority Report’ style interfaces that utilize the client’s hardware. Movies predating the 1970’s predicted so much more for our current state’s user experience level. It’s up to us, the end-consumer, to encourage innovation in this space. It’s up to us, the developer, to build killer-applications that require tapping into a computer’s powerful hardware.The more we hype up web 2.0 and dumb-downed webpage experiences, the more website-based services we get – and consequently, less innovation in hardware driven UI’s.

But there’s hope. I think there exists a fair compromise between client-end applications and server-side web services. Internet is getting faster, the browser + Flash are getting fine tuned to make better use of a computer’s resources. Soon, the internet will be well-suited for thin-client computing. A great example of this already exists today, and I’m sure many of you have used it: Google Earth. It’s a client-end application – taking advantage of the computer’s graphics and processing power to make the user feel like he/she is traveling in and out of space – while being a server-side service since it gathers updated geographical data from the web. The only problem is there’s no cross-platform, preexisting layer to build applications like this. How do we make these services without forcing the user to do an interventionist, slow installation? How do we make it run over different platforms? Personally, I think Microsoft completely missed the boat here with .NET. If MS could have recognized the web phenomena early on, they could have build this layer into Vista to encourage users to develop these rich thin-client applications, while also promoting Vista. I have no reason to change my OS – this could have been my reason! Even if it was cross platform, if they had better performance it’s still a reason to prefer (providing some business case). Instead, they treated .NET as a Java-based replacement for MFC, thereby forcing developers to resort to building their cross-platform, no-installation-required services through AJAX and Flash.

Now, even if this layer existed, which would enable developers to build and instantly deploy Google Earth style applications in a cross-platform manner, there would be security concerns. I mean, one could make the case that ActiveX attempted to do this – allowing developers to run arbitrary code on the client’s machines. Unfortunately, this led to numerous viruses. Security violations and spyware scare(d) all of us – so much so that we now do traditionally client-end functions through a dumb-downed web browser interface. But, I think we made some serious inroads in security since then. The fact that we even recognize security in current development makes us readily prepared to support such a platform. I am confident that the potential security issues can be tackled.

To make a final point, I think we all really need higher expectations in the user experience front. We need to develop killer applications that push the limitations of our hardware – to promote innovation and progress. We’re currently at a standstill in my opinion. This isn’t how the internet should be. This is not how I envisioned the future to be like 5 years ago. We can do better. We can build richer applications. But to do this, we as consumers must demand it in order for companies to have a business case to further pursue it. We need developers to come up with innovative ways of visualizing the large amounts of data being generated with the use of hardware – thereby delivering long-awaited killer-applications for our idly computers. Let’s take our futuristic dreams and finally translate them into our present reality.

Advertisements

7 Comments

Filed under Computer Science, Databases, Google, Hardware, UI, Web2.0

Google Co-op just got del.icio.us!

Update: Sorry, link is going up and down. Worth trying, but will try to find a more stable option when time cycles free up.

This past week I decided to cook up a service (link in bold near the middle of this post) I feel will greatly assist users in developing advanced Google Custom Search Engines (CSE’s). I read through the Co-op discussion posts, digg/blog comments, reviews, emails, etc. and learned many of our users are fascinated by the refinements feature – in particular, building search engines that produce results like this:

‘linear regression” on my Machine Learning Search Engine

… but unfortunately, many do not know how to do this nor understand/want to hack up the XML. Additionally, I think it’s fair to say many users interested in building advanced CSE’s have already done similar site tagging/bookmarking through services like del.icio.us. del.icio.us really is great. Here are a couple of reasons why people should (and do) use del.icio.us:

  • It’s simple and clean
  • You can multi-tag a site quickly (comma separated field; don’t have to keep reopening the bookmarklet like with Google’s)
  • You can create new tags on the fly (don’t choose the labels from a fixed drop-down like with Google’s)
  • The bookmarklet provides auto-complete tag suggestions; shows you the popular tags others have used for that current site
  • Can have bundles (two level tag hierarchies)
  • Can see who else has bookmarked the site (can also view their comments); builds a user community
  • Generates a public page serving all your bookmarks

Understandably, we received several requests to support del.icio.us bookmark importing. My part-time role with Google just ended last Friday, so, as a non-Googler, I decided to build this project. Initially, I was planning to write a simple service to convert del.icio.us bookmarks into CSE annotations – and that’s it – but realized, as I learned more about del.icio.us, that there were several additional features I could develop that would make our users’ lives even easier. Instead of just generating the annotations, I decided to also generate the CSE contexts as well.

Ok, enough talk, here’s the final product:
http://basundi.com:8000/login.html

If you don’t have a del.icio.us account, and just want to see how it works, then shoot me an email (check the bottom of the Bio page) and I’ll send you a dummy account to play with (can’t publicize it or else people might spam it or change the password).

Here’s a quick feature list:

  • Can build a full search engine (like the machine learning one above) in two steps, without having to edit any XML, and in less than two minutes
  • Auto-generates the CSE annotations XML from your del.icio.us bookmarks and tags
  • Provides an option to auto-generate CSE annotations just for del.icio.us bookmarks that have a particular tag
  • Provides an option to Auto-calculate each annotation’s boost score (log normalizes over the max # of Others per bookmark)
  • Provides an option to Auto-expand links (appends a wildcard * to any links that point to a directory)
  • Auto-generates the CSE context XML
  • Auto-generates facet titles
  • Since there’s a four facet by five labels restriction (that’s the max that one can fit in the refinements display on the search results page), I provide two options for automatic facet/refinement generation:
    • The first uses a machine learning algorithm to find the four most frequent disjoint 5-item-sets (based on the # of del.icio.us tag co-occurrences; it then does query-expansion over the tag sets to determine good facet titles)
    • The other option returns the user’s most popular del.ico.us bundles and corresponding tags
    • Any refinements that do not make it in the top 4 facets are dumped in a fifth facet in order of popularity. If you don’t understand this then don’t worry, you don’t need to! The point is all of this is automated for you (just use the default Cluster option). If you want control over which refinements/facets get displayed, then just choose Bundle.
  • Provides help documentation links at key steps
  • And best of all … You don’t need to understand the advanced options of Google CSE/Co-op to build an advanced CSE! This seriously does all the hard, tedious work for you!

In my opinion, there’s no question that this is the easiest way to make a fancy search engine. If I make any future examples I’m using this – I can simply use del.icio.us, sign-in to this service, and voila I have a search engine with facets and multi-label support.


Please note that this tool is not officially endorsed by nor affiliated with Google or Yahoo! It was just something I wanted to work on for fun that I think will benefit many users (including myself). Also, send your feedback/issues/bugs to me or post them on this blog.

74 Comments

Filed under AI, Co-op, CS, CSE, Google, Machine Learning, Research, Tagging

Google Co-op — An Intro & Some Insider Hacks

http://www.google.com/coop

So what is it? It’s called Google Co-op, a platform which enables users to build their own vertical search engines and make money off the advertisements. It provides a clean, easy interface for simple site restrictions (like what Yahoo! Search Builder and Live Macros offer) plus a number of power user features for tweaking the search results. The user has control over the look and feel (to embed the search box on their own site), can rank results, and even (multi) tag sites to let viewers filter out results by category.

But talk is cheap. So let me show you some examples of what you can do with Co-op:

http://vik.singh.googlepages.com/techstuff

This is a technology specific search engine, which lets users refine results based off Google Topics (global labels which anyone can annotate with). Basically, I was lazy here. I didn’t feel like multi-tagging sites/domains individually, so instead I just collected a laundry list of popular technology site domains in a flat file and pasted it into Google Co-op’s Custom Search Engine control panel/sites page. In addition, something I think is really useful, Google Co-op allows users to bulk upload links from OPML files. So, to make my life easier when building this, I uploaded Scoble’s and Matt Cutt’s OPML’s. Tons of great links there (and close to 1000 total). Then I clicked on the ‘filter results to just the sites I listed’ option (which I recommend you use since if you muddle your results with normal Google web search’s you typically won’t see your results popping up on the first page of results despite the higher priority level promise for hand chosen sites). To enable the filters you see on the results page (Reviews, Forums, Shopping, Blogs, etc.), I did an intersection with the background label of my search engine and the Google Topics labels. How do you that? The XML context configuration exposes a <BackgroundLabels> tag. Any labels listed in the BackgroundLabels block will be AND’ed (how cool is that). So I added the label of my search engine (each search engine has a unique background label – it can be found bolded on the Advanced Tab page) and a Google Topic label (News, Reviews, Stores, Shopping_Comparison, Blogs, Forums, etc.) in the BackgroundLabels XML block. I made a separate XML context file for each Google Topic intersection. By doing this, I didn’t have to tag any of my results and was still able to provide search filters. Google Topics does most of the hardwork and gives me search refinements for free!

But say you’re not lazy. Here’s an example of what you can do with multi-tagging and refinements.

http://vik.singh.googlepages.com/machinelearningsearch2

This one is more of a power user example – notice the refinements onebox on the search results page, and the labels with “>>” at the end. These labels redirect to another label hierarchy (a hack, I used the label redirect XML option to link to other custom search engine contexts – basically I’m nesting search engines here)

Now, say you want to get fancy with the search results presentation. Here’s a way to do it with Google’s Ajax Search API:

http://www.google.com/uds/samples/cse/index.html

Thanks to Mark Lucovsky and Matt Wytock for developing that great example.
For more information about how to use the Ajax Search API with Custom Search, please take a look at this informative post: http://googleajaxsearchapi.blogspot.com/2006/10/custom-search-engine-support.html

While writing this blog post, I realized it would take me forever to go over the number of tricks one can pull with Co-op. Instead, I’ll summarize some of the big selling point features to encourage everyone to start hacking away. Also, to help jump start power users, I’ve linked the XML files I used to make my featured search examples at the bottom of this post.

Key Feature Summary (in no particular order):

and much much more (especially for power users).

If you need a search engine for your site, and your content has been indexed by Google, then seriously consider using this rather than building your own index – or worse, using the crappy full-text functions available in relational databases.

Here are my XML files:

ml-context.xml

ml-pop-context.xml

ml-complx-context.xml

ml-source-context.xml

tech-stuff-context.xml

techreviews.xml

techforums.xml

techshopping.xml

techblogs.xml

technews.xml

tech-stuff-scoble-annotations.xml

tech-stuff-matcutts-annotations.xml

Happy Coop hacking!

55 Comments

Filed under Blog Stuff, Databases, Google, Tagging, Tutorial

The Future of Tagging

Update (9/25): A Short Presentation On This (forgot to link)
Update (5/06): An Algorithmic Update
Update (3/29): Added a Motivation section

Motivation

After reviewing the comments/emails I’ve received so far, I realized my article could make the motivation clearer. As I’ve mentioned in this post/comments, I agree with everyone regarding basic tagging – it’s by far the simplest way for any user to organize their media/text so that others can search/relate their content. I don’t see this ever going away. However, I decided to take a step back in this article and look at the issues with the current tagging model and examine an alternative method, namely hierarchical multi-labeling. Hierarchical multi-labeling solves many of the issues concerning basic tagging and should lend to better auto-tagging algorithms since it tells us how tags are related to each other. I definitely agree this isn’t something we should expect the average user to perform – but I do think power users and content aggregators like Google News could benefit greatly from this tagging model.

One of my goals this semester is to let anyone (most likely a hardcore geek) pass a tag network and training data (both represented as files) to my web service and I generate for them a classifier page (with a search box that takes in a webpage link or a string of words). Click ‘Classify’ and it’ll return the best set of tags for that input based off the training data examples. Services can use this to classify their news articles, emails, bookmarks, etc. Notice the grunt work (which isn’t too bad) is done by a geek, but all users can benefit from such a system.

Anyway, hope you enjoy the read and please comment if possible.

On Digg:

http://digg.com/software/Why_current_tagging_sucks,_and_how_to_fix_it

Two ways we can improve upon tagging:

  1. Support label hierarchies/groupings
  2. Use multiple (but ‘normalized’, explained below) labels per object

Why Hierarchical Tags?

Many app’s make use of a single level hierarchy of tags

  • Ex’s: Flickr, Gmail, Google Base, Technorati, WordPress, Delicious, YouTube
  • Big buzz around tagging
  • Finally provides users a simple way to structure their unstructured content (typically text, photos, video, and music media)
  • Makes it easier to search for related items

But one level hierarchy has issues

  • Namespace – redundant/similar named tags treated differently – wasteful
  • Not normalized – tags are not equal, general and specific in the same level
  • Loses relationships among different tags (how is tag x and tag y causally related?)
  • One level hierarchy labels are really just keywords
  • Not much different than a concise textual description
  • Only provides structural support if multiple documents use same tags
  • In it’s current form, it’s an unstructured way of structuring unstructured content
  • But better than nothing

Very simple Ex: Gmail has a one level hierarchy of tags

  • I get an email regarding my CS294 class, so I tag it with ‘CS294’
  • However, I would also like to be able to search through all emails
    relevant to school, so I have to tag it with label ‘Coursework’
  • But ‘Coursework’ encompasses ‘CS294’
  • I have to redundantly add Coursework to each CS294 tagged email even though ‘CS294’ implies ‘Coursework’
  • I could label it ‘Coursework\CS294’, but that’s just one tag specific to
    CS294, I can’t separate out the ‘Coursework’ part for search
  • This slashing technique, popularized in many Delicious sites, provides hierarchy in name only
    • Fails to capture any practical benefits like searching or relating different hierarchies

Why Multi-Label?

Objects typically (and should) fall under several categories

Many cool AI applications/data representations motivate multiple labels:

  • Medical Diagnosis
    • Real life Dr. House without the attitude
    • Normally many causes/diseases associated to a set of health features
    • Help diagnosticians to narrow down on the most likely set of causes
  • Computer Vision (i.e. Flickr, Riya, espgame.org, peekaboom.com)
    • For ex. Espgame/Peekaboom collect many labels for images/pixels
    • Could use their data to train computer vision learners for auto-tagging
  • Email/Filesystems/Databases/Search (i.e. Gmail, WinFS, SQL, Google Base)
    • File/Directory concept outdated
    • ‘Files’ are really objects which have metadata and relational aspects
    • Multi-Labels present a great, simple way to structure the diverse unstructured content in a file
    • (side-note: hierarchical tags could be used to provide backwards compatibility with File/Directory)
  • News (i.e. Google News, Digg, CNN, NY Times, Slashdot, News.com)
    • Multiple (hierarchical) labels for each news piece
    • Like seeing these labels { News.Tech.Computers.Hardware; Ideology.Geeky.Anti-Microsoft; Source.Online.Blog; People.Bill-Gates }
      … tells me a ton about an article before even reading it
    • Plus I can now search/relate these tags to find similar news articles based off specific attributes

Let’s get a bit more technical

We organize labels into trees (gives us hierarchies)

Per object, we choose multiple labels if each label comes from a different tree (hence ‘normalized’, provides some degree of separation/independence of tags)

So, what’s the point of adding all this complexity to tagging

One of the nice benefits of tagging is it’s so simple

I agree: I’m not expecting mommy and daddy to do hierarchical multi-labeling

But content providers can do this to reap the benefits described above

AND, because it will help our artificial intelligence algorithms learn how to multi-tag objects automatically (mainly because we know how tags are related to each other)

A possible machine learning algorithm for hierarchical multi-labeling

Design

We’ll build this algorithm based off binary supervised classifiers because:

  • Well understood in theory & practice; simpler, best accuracy
  • Many multiclass classifiers actually use several pairwise (all-pairs, one-versus-all, etc) binary classifiers
  • Many algorithms to work with: Perceptron, Kernels (Support Vector Machines), Neural Nets, Decision Trees, etc.

Want to create a Bayesian network based off the tag trees (actually it’s more like a Markov random field since there are undirected edges between tree nodes which occur together in the training data, annotated with CPT/SAT-based representations describing the causalities)

Ex. of a Tag Network

News

  • Sports
    • Editorial

Ideology

  • Liberal
    • Democrat
    • Marxist
  • Geeky
    • Anti-Microsoft
  • Nationalism
    • Brazil
    • USA

Athletics

  • Outdoor
    • Baseball
    • Football
  • Indoor
    • Soccer

Source

  • Paper
    • Magazine
  • Online
    • Blog
      • Wikipedia

* Does not show (undirected) links between hierarchies (like
News.Sports to Ideology.Nationalism) since it’s hard to show in text

Ex. Training Data

x1=3.45, x2=2.10, x3=5.45, x4=0.20, x5=9.20

y =

  1. News.Sports.Editorial
  2. Ideology.Nationalism.Brazil
  3. Athletics.Indoor.Soccer
  4. Source.Online.Wikipedia

x1=1.25, x2=6.93, x3=3.11, x4=8.01, x5=0.20

y=

  1. News.Tech.Computers.Hardware
  2. Ideology.Geeky.Anti-Microsoft
  3. Source.Online.Blog

How to fill in the CPT values for each bayes node in the Tag network?
We just count the tag groupings in the training data and use these numbers to generate a distribution

Learning

  • Create a Kernel (Support Vector) machine based binary classifier for each distinct tag
  • Train each binary classifier with the features from the training data whose y contains the tag (set classifier’s y = 1 for each of these feature sets)
    and with features that do not contain the tag (set classifier’s y = 0 for each)

    • (side-note: also known as one-versus-all approach, most common multiclass method)

Predicting

We run a new feature set through the set of binary classifiers, which each output a 0 or 1

Now we could just use this bitstring to immediately return a set of tags (the tags associated to 1 bits), and several existing Multi-Label approaches do this, but I think we can do better

The two main issues I have using this bitstring directly:

  1. The binary classifiers treat the tags independent of one another
  2. We don’t know which features correlate to which tags, and to what degree
    • Therefore we may be using irrelevant features (since we use them all) for training the binary classifiers, which hurts accuracy

These issues introduce errors in our bitstring b

However, we can use the tag relationships in our bayesian network to correct b

This problem lends nicely to an information theoretical approach

  • We received b over a noisy channel
  • Use what we know about tag relationships to reduce error (i.e. Hamming distance from the actual bitstring)

Reducing Error

There are several ways to go about error correcting b, here’s the one I came up with:
(any feedback esp. here would be great)

  • Use a Gibb’s (MCMC) based sampling scheme to generate candidate bitstrings from b
  • Nice convergence properties
  • Not knowing which bits are wrong in b motivates randomized/sampling methods
  • For n times, randomly choose a bit and flip it proportional to its probability in its CPT, output the new bitstring
  • This scheme occasionally alternates from sampling new bitstrings based off previously ‘sample generated’ bitstrings and off the original b (could alternate after k iterations, where k is the median hamming distance score from the strings produced by the binary classifiers compared against the actual strings for a hidden training data set)

Now we wish to find ‘interesting’ bitstrings from our set of n

  • By ‘interesting’ I mean bitstrings with frequent itemsets (related)
  • Use Apriori data mining algorithm to find these bitstrings, call the returning set s
  • Then over the bitstrings in s, scan for bits that have the same assignments and for 1 bits

Run a diagnosis (or MAP) query over the tags assigned 1 conditioned on tags assigned the same value in every bitstring, which returns our desired tag assignments

Here’s an Ex.:

Say the bitstrings we get from the Apriori algorithm are:

A B C D E F G
0 1 0 1 0 1 0
0 1 1 1 0 1 0
0 1 0 1 0 0 1

I scan the bits and see A is 0, B is 1, E is 0, in all the bitstrings

I also see C, F, G were assigned 1 in at least one bitstring

So I run this MAP query over my bayes network to find the assignment that maximizes:

Pr(C=?, F=?, G=? | A=0, B=1, E=0)

and return the tags B (since it was assigned 1 everywhere) and whichever ones will be assigned 1 by this query

Bayesian Inference and Diagnosis

Our previous step has left us with a very difficult problem to solve

  • Just doing plain ol’ bayesian inference queries is #P complete
  • But MAP queries are even harder since they need to infer the probabilities of all possible assignments

But luckily for me, I wanted to reduce a fun problem down to a MAP query

  • Let’s me relate this project to some very interesting class material 🙂

Prof. Satish Kumar in CS294 lectured on an exciting method for quickly computing inference/MAP queries

Utilizes several interesting tricks:

  • Precompiling the bayesian network into an efficient SAT-based representation
  • AND’ing each DNF clause with the query terms
  • And then counting the # of solutions to each clause using Karp’s FPRAS algorithm
  • Adding up the counters (each multiplied by some factor) solves your inference queries!

Can do DNF sampling (Las-Vegas or Atlantic-City style) under the same scheme for computing diagnosis queries

(references to learn more:
http://www.eecs.berkeley.edu/~tksk/classes/s06/handouts/lecture-06.pdf http://www.eecs.berkeley.edu/~tksk/PAPERS/thesis.pdf starting on pg. 179)

So, if we just precompile our tag network (before ever learning/predicting labels) then at runtime we can answer MAP queries in polynomial time w.r.t. the size of the SAT-based representation & exponential w.r.t the size of the largest communication link in our clique tree (an optimization that basically caches variables shared
between family nodes)

Pretty cool!

(A nice side-note property: the more variable assignments we condition on, the more we reduce the size of our SAT-based representation)

Benefits

Everything is Parallelizable!

  • Counting the tag occurences in the training data to populate the CPT’s in the bayes network (split by record)
  • Generating the set of binary classifiers and running a new feature set over the binary classifiers (since they are independent)
  • Induced correction sampling of bitstrings
  • Apriori is parallelizable
  • Scanning the bitstrings returned from Apriori for commonalities (split by bitstring, for each tag just output 1/0, then reduce the sum and check it with the total # of bitstrings)
  • Even Satish’s inference/diagnosis algorithms (split by DNF clause)

Decouples tag relations and learning

  • Our tag network does not condition on specific feature values
  • This could be seen as a bad thing but learning the causalities of specific feature values to a label (or group of labels) sounds very difficult/messy & at best application limiting since it assumes we:
    1. Know all (or even a subset of) the possible values each of the features can take
    2. Have a learning method that maps each of these individual values to a tags distribution
  • However, binary classifiers implicity find discrepencies in particular features in order to differentiate data inputs
  • In our model, the user just needs to have tag-specific binary classifiers
  • They don’t need to incorporate causality with other tags in their one-tag specific learner
    • Avoiding MultiClass learners (which aren’t nearly as well-understood nor as accurate as binary classifiers)
    • Additionally, our paradigm lets users plug-in their favorite binary classifier (many to choose from, as mentioned in Motivations)
  • Our model lets users simply construct tag dependency trees without having to explicity show why those links exist

By making the problem more complex we might actually be improving learning accuracy

  • Tag networks, hierarchal labeling, & multi-labeling – added complex functions to the system – should improve learning
  • Let’s us now exploit the overlap of tags
  • Multi-Labeling increases the chances that our learner can predict one, a subset, or all the tags correctly for a new input
    • May be a better pay off than the currently popular paradigm of where you either get the label right or you don’t

 

This work is licensed under a Creative Commons License

 

34 Comments

Filed under AI, Databases, Google, Machine Learning, Research, Tagging

Want to help the AI community? Play games!

Last week I heard Luis von Ahn‘s talk on CMU’s Human Computation project, which utilizes Human cycles to solve significant AI-Complete problems like Computer Vision and NLP. How do they enlist the help of free human labor? By wrapping these problems into addictive, interactive, web based games. Genius. For example, their ESP Game gives two remote players a series of identical images to label. For each label they both match, the higher their scores.

So, how does this help advance AI research? Well, despite our major breakthroughs in computer hardware and theory, we haven’t quite figured out a robust method to understanding the contents of a picture. This is a big problem because most pictures on the web don’t come with sufficient metadata and often have insensitive file names like ‘3.jpg’. However, unlike machines, us humans can easily categorize a set of images. These labels would significantly help search engines like Google improve their image rankings and even enable users to search for related images.

Other interesting remarks from the talk – In just the past year humans spent over 9 billion hours playing Windows Solitaire. 9 billion! ASDFLKQOW!! To put this into perspective, it took 5 million hours to construct the Empire State building and 20 million hours for the Panama Canal. That’s ridiculous. Luis estimates that if just 5,000 humans played ESP for a 2 month’s time, all the images in Google search engine could be accurately labeled.

So, the moral of this post – if you’re bored and got a few human cycles to spare, try playing these ‘productive’ games.

1 Comment

Filed under AI, Games, Google, Research