The Future of Tagging

Update (9/25): A Short Presentation On This (forgot to link)
Update (5/06): An Algorithmic Update
Update (3/29): Added a Motivation section


After reviewing the comments/emails I’ve received so far, I realized my article could make the motivation clearer. As I’ve mentioned in this post/comments, I agree with everyone regarding basic tagging – it’s by far the simplest way for any user to organize their media/text so that others can search/relate their content. I don’t see this ever going away. However, I decided to take a step back in this article and look at the issues with the current tagging model and examine an alternative method, namely hierarchical multi-labeling. Hierarchical multi-labeling solves many of the issues concerning basic tagging and should lend to better auto-tagging algorithms since it tells us how tags are related to each other. I definitely agree this isn’t something we should expect the average user to perform – but I do think power users and content aggregators like Google News could benefit greatly from this tagging model.

One of my goals this semester is to let anyone (most likely a hardcore geek) pass a tag network and training data (both represented as files) to my web service and I generate for them a classifier page (with a search box that takes in a webpage link or a string of words). Click ‘Classify’ and it’ll return the best set of tags for that input based off the training data examples. Services can use this to classify their news articles, emails, bookmarks, etc. Notice the grunt work (which isn’t too bad) is done by a geek, but all users can benefit from such a system.

Anyway, hope you enjoy the read and please comment if possible.

On Digg:,_and_how_to_fix_it

Two ways we can improve upon tagging:

  1. Support label hierarchies/groupings
  2. Use multiple (but ‘normalized’, explained below) labels per object

Why Hierarchical Tags?

Many app’s make use of a single level hierarchy of tags

  • Ex’s: Flickr, Gmail, Google Base, Technorati, WordPress, Delicious, YouTube
  • Big buzz around tagging
  • Finally provides users a simple way to structure their unstructured content (typically text, photos, video, and music media)
  • Makes it easier to search for related items

But one level hierarchy has issues

  • Namespace – redundant/similar named tags treated differently – wasteful
  • Not normalized – tags are not equal, general and specific in the same level
  • Loses relationships among different tags (how is tag x and tag y causally related?)
  • One level hierarchy labels are really just keywords
  • Not much different than a concise textual description
  • Only provides structural support if multiple documents use same tags
  • In it’s current form, it’s an unstructured way of structuring unstructured content
  • But better than nothing

Very simple Ex: Gmail has a one level hierarchy of tags

  • I get an email regarding my CS294 class, so I tag it with ‘CS294’
  • However, I would also like to be able to search through all emails
    relevant to school, so I have to tag it with label ‘Coursework’
  • But ‘Coursework’ encompasses ‘CS294’
  • I have to redundantly add Coursework to each CS294 tagged email even though ‘CS294’ implies ‘Coursework’
  • I could label it ‘Coursework\CS294’, but that’s just one tag specific to
    CS294, I can’t separate out the ‘Coursework’ part for search
  • This slashing technique, popularized in many Delicious sites, provides hierarchy in name only
    • Fails to capture any practical benefits like searching or relating different hierarchies

Why Multi-Label?

Objects typically (and should) fall under several categories

Many cool AI applications/data representations motivate multiple labels:

  • Medical Diagnosis
    • Real life Dr. House without the attitude
    • Normally many causes/diseases associated to a set of health features
    • Help diagnosticians to narrow down on the most likely set of causes
  • Computer Vision (i.e. Flickr, Riya,,
    • For ex. Espgame/Peekaboom collect many labels for images/pixels
    • Could use their data to train computer vision learners for auto-tagging
  • Email/Filesystems/Databases/Search (i.e. Gmail, WinFS, SQL, Google Base)
    • File/Directory concept outdated
    • ‘Files’ are really objects which have metadata and relational aspects
    • Multi-Labels present a great, simple way to structure the diverse unstructured content in a file
    • (side-note: hierarchical tags could be used to provide backwards compatibility with File/Directory)
  • News (i.e. Google News, Digg, CNN, NY Times, Slashdot,
    • Multiple (hierarchical) labels for each news piece
    • Like seeing these labels { News.Tech.Computers.Hardware; Ideology.Geeky.Anti-Microsoft; Source.Online.Blog; People.Bill-Gates }
      … tells me a ton about an article before even reading it
    • Plus I can now search/relate these tags to find similar news articles based off specific attributes

Let’s get a bit more technical

We organize labels into trees (gives us hierarchies)

Per object, we choose multiple labels if each label comes from a different tree (hence ‘normalized’, provides some degree of separation/independence of tags)

So, what’s the point of adding all this complexity to tagging

One of the nice benefits of tagging is it’s so simple

I agree: I’m not expecting mommy and daddy to do hierarchical multi-labeling

But content providers can do this to reap the benefits described above

AND, because it will help our artificial intelligence algorithms learn how to multi-tag objects automatically (mainly because we know how tags are related to each other)

A possible machine learning algorithm for hierarchical multi-labeling


We’ll build this algorithm based off binary supervised classifiers because:

  • Well understood in theory & practice; simpler, best accuracy
  • Many multiclass classifiers actually use several pairwise (all-pairs, one-versus-all, etc) binary classifiers
  • Many algorithms to work with: Perceptron, Kernels (Support Vector Machines), Neural Nets, Decision Trees, etc.

Want to create a Bayesian network based off the tag trees (actually it’s more like a Markov random field since there are undirected edges between tree nodes which occur together in the training data, annotated with CPT/SAT-based representations describing the causalities)

Ex. of a Tag Network


  • Sports
    • Editorial


  • Liberal
    • Democrat
    • Marxist
  • Geeky
    • Anti-Microsoft
  • Nationalism
    • Brazil
    • USA


  • Outdoor
    • Baseball
    • Football
  • Indoor
    • Soccer


  • Paper
    • Magazine
  • Online
    • Blog
      • Wikipedia

* Does not show (undirected) links between hierarchies (like
News.Sports to Ideology.Nationalism) since it’s hard to show in text

Ex. Training Data

x1=3.45, x2=2.10, x3=5.45, x4=0.20, x5=9.20

y =

  1. News.Sports.Editorial
  2. Ideology.Nationalism.Brazil
  3. Athletics.Indoor.Soccer
  4. Source.Online.Wikipedia

x1=1.25, x2=6.93, x3=3.11, x4=8.01, x5=0.20


  1. News.Tech.Computers.Hardware
  2. Ideology.Geeky.Anti-Microsoft
  3. Source.Online.Blog

How to fill in the CPT values for each bayes node in the Tag network?
We just count the tag groupings in the training data and use these numbers to generate a distribution


  • Create a Kernel (Support Vector) machine based binary classifier for each distinct tag
  • Train each binary classifier with the features from the training data whose y contains the tag (set classifier’s y = 1 for each of these feature sets)
    and with features that do not contain the tag (set classifier’s y = 0 for each)

    • (side-note: also known as one-versus-all approach, most common multiclass method)


We run a new feature set through the set of binary classifiers, which each output a 0 or 1

Now we could just use this bitstring to immediately return a set of tags (the tags associated to 1 bits), and several existing Multi-Label approaches do this, but I think we can do better

The two main issues I have using this bitstring directly:

  1. The binary classifiers treat the tags independent of one another
  2. We don’t know which features correlate to which tags, and to what degree
    • Therefore we may be using irrelevant features (since we use them all) for training the binary classifiers, which hurts accuracy

These issues introduce errors in our bitstring b

However, we can use the tag relationships in our bayesian network to correct b

This problem lends nicely to an information theoretical approach

  • We received b over a noisy channel
  • Use what we know about tag relationships to reduce error (i.e. Hamming distance from the actual bitstring)

Reducing Error

There are several ways to go about error correcting b, here’s the one I came up with:
(any feedback esp. here would be great)

  • Use a Gibb’s (MCMC) based sampling scheme to generate candidate bitstrings from b
  • Nice convergence properties
  • Not knowing which bits are wrong in b motivates randomized/sampling methods
  • For n times, randomly choose a bit and flip it proportional to its probability in its CPT, output the new bitstring
  • This scheme occasionally alternates from sampling new bitstrings based off previously ‘sample generated’ bitstrings and off the original b (could alternate after k iterations, where k is the median hamming distance score from the strings produced by the binary classifiers compared against the actual strings for a hidden training data set)

Now we wish to find ‘interesting’ bitstrings from our set of n

  • By ‘interesting’ I mean bitstrings with frequent itemsets (related)
  • Use Apriori data mining algorithm to find these bitstrings, call the returning set s
  • Then over the bitstrings in s, scan for bits that have the same assignments and for 1 bits

Run a diagnosis (or MAP) query over the tags assigned 1 conditioned on tags assigned the same value in every bitstring, which returns our desired tag assignments

Here’s an Ex.:

Say the bitstrings we get from the Apriori algorithm are:

0 1 0 1 0 1 0
0 1 1 1 0 1 0
0 1 0 1 0 0 1

I scan the bits and see A is 0, B is 1, E is 0, in all the bitstrings

I also see C, F, G were assigned 1 in at least one bitstring

So I run this MAP query over my bayes network to find the assignment that maximizes:

Pr(C=?, F=?, G=? | A=0, B=1, E=0)

and return the tags B (since it was assigned 1 everywhere) and whichever ones will be assigned 1 by this query

Bayesian Inference and Diagnosis

Our previous step has left us with a very difficult problem to solve

  • Just doing plain ol’ bayesian inference queries is #P complete
  • But MAP queries are even harder since they need to infer the probabilities of all possible assignments

But luckily for me, I wanted to reduce a fun problem down to a MAP query

  • Let’s me relate this project to some very interesting class material 🙂

Prof. Satish Kumar in CS294 lectured on an exciting method for quickly computing inference/MAP queries

Utilizes several interesting tricks:

  • Precompiling the bayesian network into an efficient SAT-based representation
  • AND’ing each DNF clause with the query terms
  • And then counting the # of solutions to each clause using Karp’s FPRAS algorithm
  • Adding up the counters (each multiplied by some factor) solves your inference queries!

Can do DNF sampling (Las-Vegas or Atlantic-City style) under the same scheme for computing diagnosis queries

(references to learn more: starting on pg. 179)

So, if we just precompile our tag network (before ever learning/predicting labels) then at runtime we can answer MAP queries in polynomial time w.r.t. the size of the SAT-based representation & exponential w.r.t the size of the largest communication link in our clique tree (an optimization that basically caches variables shared
between family nodes)

Pretty cool!

(A nice side-note property: the more variable assignments we condition on, the more we reduce the size of our SAT-based representation)


Everything is Parallelizable!

  • Counting the tag occurences in the training data to populate the CPT’s in the bayes network (split by record)
  • Generating the set of binary classifiers and running a new feature set over the binary classifiers (since they are independent)
  • Induced correction sampling of bitstrings
  • Apriori is parallelizable
  • Scanning the bitstrings returned from Apriori for commonalities (split by bitstring, for each tag just output 1/0, then reduce the sum and check it with the total # of bitstrings)
  • Even Satish’s inference/diagnosis algorithms (split by DNF clause)

Decouples tag relations and learning

  • Our tag network does not condition on specific feature values
  • This could be seen as a bad thing but learning the causalities of specific feature values to a label (or group of labels) sounds very difficult/messy & at best application limiting since it assumes we:
    1. Know all (or even a subset of) the possible values each of the features can take
    2. Have a learning method that maps each of these individual values to a tags distribution
  • However, binary classifiers implicity find discrepencies in particular features in order to differentiate data inputs
  • In our model, the user just needs to have tag-specific binary classifiers
  • They don’t need to incorporate causality with other tags in their one-tag specific learner
    • Avoiding MultiClass learners (which aren’t nearly as well-understood nor as accurate as binary classifiers)
    • Additionally, our paradigm lets users plug-in their favorite binary classifier (many to choose from, as mentioned in Motivations)
  • Our model lets users simply construct tag dependency trees without having to explicity show why those links exist

By making the problem more complex we might actually be improving learning accuracy

  • Tag networks, hierarchal labeling, & multi-labeling – added complex functions to the system – should improve learning
  • Let’s us now exploit the overlap of tags
  • Multi-Labeling increases the chances that our learner can predict one, a subset, or all the tags correctly for a new input
    • May be a better pay off than the currently popular paradigm of where you either get the label right or you don’t


This work is licensed under a Creative Commons License



34 thoughts on “The Future of Tagging

  1. I’ve seen other proposals for “standardized”, “structured” or “hierarchical” tagging methods before… Whether for other web sites, music, general metadata, whatever. Honestly, you’re just reinventing the wheel, and in a really inferior way too. Librarians have been dong this stuff for decades. It’s called LC Subject Classification ( and it covers pretty much everything that anyone has ever written about.

    And it’s being destroyed by keyword and natural language searching. Nobody wants to have to memorize a giant and difficult heirarchy of subjects, when they can just type in whatever they use to describe what they want to find. You won’t get results from people who use different words and descriptions than you do, and you might not get the best possible results, but you’ll get results anyways.

    The lesson is people are lazy.

  2. Thanks for the comment Klyith.

    I agree – I’m not saying my design is novel, nor do I think normal users will do hierarchical multi-labeling on their content (as I mentioned in my post). But, I do think content providers or power users can use this scheme to better organize their data.

    Thanks for the link to the LC subject classification hierarchy. My scheme is pretty much the same except I enable content providers to devise their own hierarchy specific to their data. For example, consider if Google News provided me these tags for a science/tech article:

    News.Computer.Hardware.Ulta-Portable.Orgami, Ideology.Geeky.Anti-Microsoft, People.BillG, Source.Online.Blog

    This gives me tremendous value – gives me context AND the ability to search for articles in any of the labels (and any of the hierarchies within a label).

    Now I’m pretty sure LC hierarchy doesn’t have tags this specific – that’s why I think content providers should be able to customize their own labels. Additionally, data/info/news change all the time, so labels should evolve with the data (which content providers can do here since they control the structure of the tags).

    But none of this is new, it’s simply a tree of tags. Tagging many labels to a document is also not new. I’m just describing some of their benefits compared to the tags I’ve seen that look like keywords.

    However, one point I don’t quite understand is why this is a ‘really inferior way’. It sounds like the same thing as the LC classification system. The crux of this post is to show how this mere structure could lend to a clean machine learning algorithm for doing all of this tagging stuff for us automatically.

  3. The whole point of tagging is the insight that the world is not hierarchical.

    If the world was structured hierarchically ontologies such as LC would work and machine learning would be a lot simpler.

    But it isn’t.

    So we use tags. And we use machine learning to find some (not necessarily hierarchical) structure among the tags. If user A uses tag X and user B uses tag Y to describe the same concept it’s not too difficult to find that, statistically, X and Y are equivalent. And if X is a subconcept of Y we can deduce this, given enough data.

    So machine learning plays a role (or should play in the near future). But not in the sense that we structure our descriptions such that they are easy to process for a machine learning algorithm. We structure them so that they make sense to us.

  4. Well thought article, but I agree with Matt. The whole reason I like tagging things is because I don’t have to think hierarchically.

  5. Hi Matt – Thanks for the comment. I agree with you, simple tags with no concern of hierarchy is definitely easier, and there are ways to still do machine learning given enough data. However, I’m not sure if I agree that ‘tagging is the insight that the world isn’t hierarchical’, considering the number of Delicious sites I’ve seen which attempt to organize their tags in groupings using the slashing technique, or the fact that many people organize their emails and files in hierarchies (i.e. file/directory).

    As I’ve mentioned in my post, I don’t think many people will use the technique I’ve outlined above – especially when the alternative of just laundry listing common keywords is so much easier. In these cases we’ll need to come up with better algorithms to machine learn the tags.

    However, I looked at this problem with a “What if I wanted to machine learn the tags corresponding to my data right now, what would be the best way to organize my tags to maximize prediction accuracy?” mindset. This is in a more ideal but controlled world where I’m willing to put in more effort than I really should to give my learner an easier time. To do this, I organize the tags in a dependency graph (which incidentally gives me hierarchies for free).

    Tag trees tell us a ton about how labels are related – the leaner I describe uses it to understand the overlap among tags for error correction. Now currently in machine learning literature we have pretty well understood algorithms for doing binary classification (YES/NO) and ok stuff for doing multiclass (Select one of the following: Conservative, Liberal, Moderate). However, there isn’t much literature/standard techniques when it comes to multi-labeling (probably because people got their hands full with the previous problem).

    Say this is one of our training examples:

    foo.html -> {News.Political, Idelogy.Liberal.Marxist, etc.}

    It’s difficult for our leaners to do this because it needs to find features in foo.html that correspond to each tag and implicity find relations among tag groupings (since they aren’t disjoint anymore like in multiclass). Basically what I’m trying to say is multi-labeling is a very difficult problem, and at the point where multiclass is already hard enough, expecting our learners to statistically learn how to do unstructured multi-tagging might be asking for a lot. My quest in this post is to examine how structuring tags could help learners better predict, and I try to make a case for that and the possibility that it might even perform better than multiclass learners.

  6. Tagging is probably the most accessible way to organize stuff today. If the system was more advanced, I probably wouldn’t bother using it at all, and I suspect other users feel like that too.

    It could probably be useful with metatags, so that you could easily specify that “if post x is tagged with y, it is related to z” so that tagging x with y automatically tags it with z. If it gets more complicated than that, I suspect that the companies that have gained acceptance by mainstream users, such as and flickr would lose their appeal to non-technical users if they got more advanced.

  7. Hi Simen – I definitely agree. As I mentioned in the post and earlier comments I don’t expect/wish for the average user to do hierarchical multi-labeling – we’re lucky enough that they even tag anything! This proof of concept technique would be more tailored for content providers and power users.

  8. I am not sure if you mean prescribing a particular hierarchical semantic structure for a particular domain.

    But that wouldn’t be needed: you could extend the tagging in a ‘free’ way by allowing the tags themselves to be tagged. This would invert the tree, so the ‘CS294’ tag would be tagged ‘Coursework’, for example.

    Maybe that is what you mean…

  9. Flat tagging classification simply runs out of gas when you try to manage a collection of tags larger than a few dozen. If you’re used to creating folder trees for disk collections, tagtrees are an obvious remedy, and they can reduce directly to multiple membership in a flat tagging scheme if you want them to — Movies.Comedy and Movies.Drama can reduce to (Movies, Comedy) and (Movies, Drama).

    You can either define such tag hierarchies explicitly, as lets you do, or I think a more interesting approach would be to have the machine create visualizations of relevant tagtrees automatically based on multiple tag memberships.

    An argument for human-generated heirarchies is that a tree can quickly get too wide. An automatically generated view of “Books” for instance would reveal child tags not only for genres but for every single author in the database, which would be unwieldy.

    The comment about tagging tags is interesting. The same effect could be achieved by assigning tags from multiple trees to the same end nodes, but this puts a lot of burden on the taggers. Maintaining a general ruleset ala Outlook folder sorting rules sounds tedious too, though.’ recommended tags feature reduces some of the burden of associating multiple tags, so maybe this is the path to pursue.

    Re tools becoming too complex, a tool having advanced features shouldn’t be discouraging to users, so long as they can still walk up to it and find it immediately useful without investing in the time to become an expert. Google for instance has a great set of optional advanced features that casual users don’t need to worry about.

  10. Isn’t “triple-tagging” an attempt to trick a tagging system into performing the functions of a relational database? Tagtrees also provide a syntax for expressing such relationships, don’t they?

    I.e. “ geo.long.-2.5678” rather than


    If you generally have complex multi-valued relationships like book/author/title/year/publisher, and lots of disparate collections (animals, vegetables, minerals) each with their own ontology, which system is preferable?

  11. I think you’ve overdone it — tagging works *because* it is less structured and casual, and that’s ideal for us human beings. Any attempt to normalize or apply strict rules will make it useless for us.

    That said, I do agree that for some applications, hierarchical tags would be useful as in your email example — but these are where a *single* user defines and uses the struture.

    Read The Grand Unified Theory of Everything: Search vs. Filing vs. Tagging

  12. FotoFlix adds a layer on top of tags that lets you create a heirarchy of tag combinations. They’re called QuickSets but what they are really just dynamic tag groupings in a heirarchial layout. An example would be

    Family (tags:family)
    +Kim (tags:family,kim)
    +Barbi (tags:family,barbi)
    Vactions (tags:vacations)
    +Hawaii (tags:vacations,hawaii)

  13. @Simen
    “…Tagging is probably the most accessible way to organize stuff today. If the system was more advanced, I probably wouldn’t bother using it at all, and I suspect other users feel like that too. …”

    Yes, I feel like that too.

  14. If flat tagging were enough for casual users, then cataloging programs like iTunes wouldn’t be so popular. We’re all suffering from information overload. What users are really saying is “if your solution is too complicated, I won’t use it.”

    As the tag space grows, you need metastructure of some kind; the question is what kind?

    Simen, I agree that a relational database is a better model for memory than a hierarchy, but how does one proceed in imbuing a tag query with some of the advantages of a database query?

    If one is navigating a tagspace of MP3 files, you can descend via different routes — artist first, genre first, etc. A relational approach gives the user the experience of parallel hierarchies.

  15. I agree that the relational database model only gets you that far; adding quieries of tags would introduce unnecessary bloat. I think tagging combined with a natural language search algorithm of some kind could probably be combined with tagging. There was a Techcrunch feature not long ago – I can’t seem to remember the url – of a starup which lets you bookmark/tag/share Google searches style. That’s possibly a way to go.

  16. While all information might be kept in a relational database – different views – i.e. both tagged and hierarchy views can be generated from the same database.

    There is no conflict between data held in tree-hierarchies and tagged views – both are functionally equivalent.

    The same fundamental data underlies both views – its how we use it that counts.

  17. Tagging has had success because it is lightweight enough metadata to be easy, and it’s non-hierarchical nature lends itself to social applications (like quick aggregation of photo themes on flickr). I think a better approach will be to build tags into the interaction model of applications, turn the actions and intentions of users into inferred or implied tags, then surface that information as a basis for explicit meta-tagging action later (instead of putting so much of a burden on the user).

    Right now, for example, a searcher types in one or two words as a query to a search engine, then they might pick one or two Web pages to visit. Essentially, they have tagged those pages with the query term, but that metadata is currently lost to the searcher. Instead we ask the searcher to take explicit action to save a Web page, then tag it with terms that may have nothing to do with their original query; the relationship between the two metadata linked to the same entity could be very valuable to both the user and the system. I think this kind of explicit/implicit tagging could add a lot of relevance and richness to a lot of sites and applications.

  18. In my opinion tagging–as currently implemented–is useless and overhyped. I use blinklist as my social bookmarking site (rather than and it leans heavily on tags. The worst thing about so many implementations of tags is that they’re not practical. On Blinklist, in order to find a site I tagged, I have to remember the exact tag I used. Isn’t the whole idea of tags to help us out and make things easier? So far it seems like every site has tagging on it in some capacity, but no site/app makes intelligent use of them. Grrrr. Tagging pisses me off, though I hope eventually someone will come up with some intelligent tag practices that will be adopted webwide.

  19. What the heck is NHTML?

    The idea of hierarchial tagging seems interesting. How would it be if one used “NHTML” tags? Although no tagging service i know of supports this kind of a thing.

    It can work in the case of your e-mail classification, each tag say “CS298” were NHTML’ed with “course work”, every time you want to tag something with these two tags, you can use the NHTML’ed tag. It is a list of tags but each is linked from a single NHTML’ed tag.

    To make matters easy a mouse over each character can be used to show what each tag actually is.Each of these NHTML tags are a kind of a user generated tree.

    The non-technical user can be spaed the burden by automatting the process of NHTML’ing….

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s