March 2006 – Vik's Blog

Update (9/25): A Short Presentation On This (forgot to link)
Update (5/06): An Algorithmic Update
Update (3/29): Added a Motivation section

Motivation

After reviewing the comments/emails I’ve received so far, I realized my article could make the motivation clearer. As I’ve mentioned in this post/comments, I agree with everyone regarding basic tagging – it’s by far the simplest way for any user to organize their media/text so that others can search/relate their content. I don’t see this ever going away. However, I decided to take a step back in this article and look at the issues with the current tagging model and examine an alternative method, namely hierarchical multi-labeling. Hierarchical multi-labeling solves many of the issues concerning basic tagging and should lend to better auto-tagging algorithms since it tells us how tags are related to each other. I definitely agree this isn’t something we should expect the average user to perform – but I do think power users and content aggregators like Google News could benefit greatly from this tagging model.

One of my goals this semester is to let anyone (most likely a hardcore geek) pass a tag network and training data (both represented as files) to my web service and I generate for them a classifier page (with a search box that takes in a webpage link or a string of words). Click ‘Classify’ and it’ll return the best set of tags for that input based off the training data examples. Services can use this to classify their news articles, emails, bookmarks, etc. Notice the grunt work (which isn’t too bad) is done by a geek, but all users can benefit from such a system.

Anyway, hope you enjoy the read and please comment if possible.

On Digg:

http://digg.com/software/Why_current_tagging_sucks,_and_how_to_fix_it

Two ways we can improve upon tagging:

Support label hierarchies/groupings
Use multiple (but ‘normalized’, explained below) labels per object

Why Hierarchical Tags?

Many app’s make use of a single level hierarchy of tags

Ex’s: Flickr, Gmail, Google Base, Technorati, WordPress, Delicious, YouTube
Big buzz around tagging
Finally provides users a simple way to structure their unstructured content (typically text, photos, video, and music media)
Makes it easier to search for related items

But one level hierarchy has issues

Namespace – redundant/similar named tags treated differently – wasteful
Not normalized – tags are not equal, general and specific in the same level
Loses relationships among different tags (how is tag x and tag y causally related?)
One level hierarchy labels are really just keywords
Not much different than a concise textual description
Only provides structural support if multiple documents use same tags
In it’s current form, it’s an unstructured way of structuring unstructured content
But better than nothing

Very simple Ex: Gmail has a one level hierarchy of tags

I get an email regarding my CS294 class, so I tag it with ‘CS294’
However, I would also like to be able to search through all emails
relevant to school, so I have to tag it with label ‘Coursework’
But ‘Coursework’ encompasses ‘CS294’
I have to redundantly add Coursework to each CS294 tagged email even though ‘CS294’ implies ‘Coursework’
I could label it ‘Coursework\CS294’, but that’s just one tag specific to
CS294, I can’t separate out the ‘Coursework’ part for search
This slashing technique, popularized in many Delicious sites, provides hierarchy in name only
- Fails to capture any practical benefits like searching or relating different hierarchies

Why Multi-Label?

Objects typically (and should) fall under several categories

Many cool AI applications/data representations motivate multiple labels:

Medical Diagnosis
- Real life Dr. House without the attitude
- Normally many causes/diseases associated to a set of health features
- Help diagnosticians to narrow down on the most likely set of causes

Computer Vision (i.e. Flickr, Riya, espgame.org, peekaboom.com)
- For ex. Espgame/Peekaboom collect many labels for images/pixels
- Could use their data to train computer vision learners for auto-tagging

Email/Filesystems/Databases/Search (i.e. Gmail, WinFS, SQL, Google Base)
- File/Directory concept outdated
- ‘Files’ are really objects which have metadata and relational aspects
- Multi-Labels present a great, simple way to structure the diverse unstructured content in a file
- (side-note: hierarchical tags could be used to provide backwards compatibility with File/Directory)

News (i.e. Google News, Digg, CNN, NY Times, Slashdot, News.com)
- Multiple (hierarchical) labels for each news piece
- Like seeing these labels { News.Tech.Computers.Hardware; Ideology.Geeky.Anti-Microsoft; Source.Online.Blog; People.Bill-Gates }
  … tells me a ton about an article before even reading it
- Plus I can now search/relate these tags to find similar news articles based off specific attributes

Let’s get a bit more technical

We organize labels into trees (gives us hierarchies)

Per object, we choose multiple labels if each label comes from a different tree (hence ‘normalized’, provides some degree of separation/independence of tags)

So, what’s the point of adding all this complexity to tagging

One of the nice benefits of tagging is it’s so simple

I agree: I’m not expecting mommy and daddy to do hierarchical multi-labeling

But content providers can do this to reap the benefits described above

AND, because it will help our artificial intelligence algorithms learn how to multi-tag objects automatically (mainly because we know how tags are related to each other)

A possible machine learning algorithm for hierarchical multi-labeling

Design

We’ll build this algorithm based off binary supervised classifiers because:

Well understood in theory & practice; simpler, best accuracy
Many multiclass classifiers actually use several pairwise (all-pairs, one-versus-all, etc) binary classifiers
Many algorithms to work with: Perceptron, Kernels (Support Vector Machines), Neural Nets, Decision Trees, etc.

Want to create a Bayesian network based off the tag trees (actually it’s more like a Markov random field since there are undirected edges between tree nodes which occur together in the training data, annotated with CPT/SAT-based representations describing the causalities)

Ex. of a Tag Network

News

Sports
- Editorial

Ideology

Liberal
- Democrat
- Marxist

Geeky
- Anti-Microsoft

Nationalism
- Brazil
- USA

Athletics

Outdoor
- Baseball
- Football

Indoor
- Soccer

Source

Paper
- Magazine
Online
- Blog
  - Wikipedia

* Does not show (undirected) links between hierarchies (like
News.Sports to Ideology.Nationalism) since it’s hard to show in text

Ex. Training Data

x1=3.45, x2=2.10, x3=5.45, x4=0.20, x5=9.20

y =

News.Sports.Editorial
Ideology.Nationalism.Brazil
Athletics.Indoor.Soccer
Source.Online.Wikipedia

x1=1.25, x2=6.93, x3=3.11, x4=8.01, x5=0.20

News.Tech.Computers.Hardware
Ideology.Geeky.Anti-Microsoft
Source.Online.Blog

How to fill in the CPT values for each bayes node in the Tag network?
We just count the tag groupings in the training data and use these numbers to generate a distribution

Learning

Create a Kernel (Support Vector) machine based binary classifier for each distinct tag
Train each binary classifier with the features from the training data whose y contains the tag (set classifier’s y = 1 for each of these feature sets)
and with features that do not contain the tag (set classifier’s y = 0 for each)
- (side-note: also known as one-versus-all approach, most common multiclass method)

Predicting

We run a new feature set through the set of binary classifiers, which each output a 0 or 1

Now we could just use this bitstring to immediately return a set of tags (the tags associated to 1 bits), and several existing Multi-Label approaches do this, but I think we can do better

The two main issues I have using this bitstring directly:

The binary classifiers treat the tags independent of one another
We don’t know which features correlate to which tags, and to what degree
- Therefore we may be using irrelevant features (since we use them all) for training the binary classifiers, which hurts accuracy

These issues introduce errors in our bitstring b

However, we can use the tag relationships in our bayesian network to correct b

This problem lends nicely to an information theoretical approach

We received b over a noisy channel
Use what we know about tag relationships to reduce error (i.e. Hamming distance from the actual bitstring)

Reducing Error

There are several ways to go about error correcting b, here’s the one I came up with:
(any feedback esp. here would be great)

Use a Gibb’s (MCMC) based sampling scheme to generate candidate bitstrings from b
Nice convergence properties
Not knowing which bits are wrong in b motivates randomized/sampling methods
For n times, randomly choose a bit and flip it proportional to its probability in its CPT, output the new bitstring
This scheme occasionally alternates from sampling new bitstrings based off previously ‘sample generated’ bitstrings and off the original b (could alternate after k iterations, where k is the median hamming distance score from the strings produced by the binary classifiers compared against the actual strings for a hidden training data set)

Now we wish to find ‘interesting’ bitstrings from our set of n

By ‘interesting’ I mean bitstrings with frequent itemsets (related)
Use Apriori data mining algorithm to find these bitstrings, call the returning set s
Then over the bitstrings in s, scan for bits that have the same assignments and for 1 bits

Run a diagnosis (or MAP) query over the tags assigned 1 conditioned on tags assigned the same value in every bitstring, which returns our desired tag assignments

Here’s an Ex.:

Say the bitstrings we get from the Apriori algorithm are:

A B C D E F G
0 1 0 1 0 1 0
0 1 1 1 0 1 0
0 1 0 1 0 0 1

I scan the bits and see A is 0, B is 1, E is 0, in all the bitstrings

I also see C, F, G were assigned 1 in at least one bitstring

So I run this MAP query over my bayes network to find the assignment that maximizes:

Pr(C=?, F=?, G=? | A=0, B=1, E=0)

and return the tags B (since it was assigned 1 everywhere) and whichever ones will be assigned 1 by this query

Bayesian Inference and Diagnosis

Our previous step has left us with a very difficult problem to solve

Just doing plain ol’ bayesian inference queries is #P complete
But MAP queries are even harder since they need to infer the probabilities of all possible assignments

But luckily for me, I wanted to reduce a fun problem down to a MAP query

Let’s me relate this project to some very interesting class material 🙂

Prof. Satish Kumar in CS294 lectured on an exciting method for quickly computing inference/MAP queries

Utilizes several interesting tricks:

Precompiling the bayesian network into an efficient SAT-based representation
AND’ing each DNF clause with the query terms
And then counting the # of solutions to each clause using Karp’s FPRAS algorithm
Adding up the counters (each multiplied by some factor) solves your inference queries!

Can do DNF sampling (Las-Vegas or Atlantic-City style) under the same scheme for computing diagnosis queries

(references to learn more:
http://www.eecs.berkeley.edu/~tksk/classes/s06/handouts/lecture-06.pdf http://www.eecs.berkeley.edu/~tksk/PAPERS/thesis.pdf starting on pg. 179)

So, if we just precompile our tag network (before ever learning/predicting labels) then at runtime we can answer MAP queries in polynomial time w.r.t. the size of the SAT-based representation & exponential w.r.t the size of the largest communication link in our clique tree (an optimization that basically caches variables shared
between family nodes)

Pretty cool!

(A nice side-note property: the more variable assignments we condition on, the more we reduce the size of our SAT-based representation)

Benefits

Everything is Parallelizable!

Counting the tag occurences in the training data to populate the CPT’s in the bayes network (split by record)
Generating the set of binary classifiers and running a new feature set over the binary classifiers (since they are independent)
Induced correction sampling of bitstrings
Apriori is parallelizable
Scanning the bitstrings returned from Apriori for commonalities (split by bitstring, for each tag just output 1/0, then reduce the sum and check it with the total # of bitstrings)
Even Satish’s inference/diagnosis algorithms (split by DNF clause)

Decouples tag relations and learning

Our tag network does not condition on specific feature values
This could be seen as a bad thing but learning the causalities of specific feature values to a label (or group of labels) sounds very difficult/messy & at best application limiting since it assumes we:
1. Know all (or even a subset of) the possible values each of the features can take
2. Have a learning method that maps each of these individual values to a tags distribution

However, binary classifiers implicity find discrepencies in particular features in order to differentiate data inputs
In our model, the user just needs to have tag-specific binary classifiers
They don’t need to incorporate causality with other tags in their one-tag specific learner
- Avoiding MultiClass learners (which aren’t nearly as well-understood nor as accurate as binary classifiers)
- Additionally, our paradigm lets users plug-in their favorite binary classifier (many to choose from, as mentioned in Motivations)

Our model lets users simply construct tag dependency trees without having to explicity show why those links exist

By making the problem more complex we might actually be improving learning accuracy

Tag networks, hierarchal labeling, & multi-labeling – added complex functions to the system – should improve learning
Let’s us now exploit the overlap of tags
Multi-Labeling increases the chances that our learner can predict one, a subset, or all the tags correctly for a new input
- May be a better pay off than the currently popular paradigm of where you either get the label right or you don’t

Vik's Blog

Posts that pay homage to Jim Gray's "Let's party on the data" line.

Month: March 2006

The Future of Tagging