Category Archives: Research

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).



Filed under CS, Databases, Education, Publications, Research, Science

SQL Text Mining

One of the projects Jim Gray and I worked on this summer was classifying the types of SQL users ask on the SkyServer site ( ). We were surprised that we could not find any existing research that could describe methods on how to break down the SQL for categorization – especially considering the number of websites and database workloads that bookkeep query logs. Below is a link to the powerpoint presentation I gave at MSR Mountain View last week which describes how we analyzed the SQL. Notable features include text processing strategies, clustering algorithms, distance functions, and two example applications (Bot detection and Query recommendation). We plan to publish our algorithms and results in a technical report in the next month or so – but for now, enjoy the .ppt. As always, comments are more than welcome.

SQL Text Mining Presentation

Creative Commons License

1 Comment

Filed under AI, CS, Databases, Machine Learning, Research

Hierarchical Multi-Labeling Presentation Slides

Below is a small class talk I gave on the hierarchical multi-labeling classification framework I outlined in my previous ‘Future of Tagging’ posts. I did a small experiment classifying tech news articles as Pro/Anti- Microsoft/Google (along with some other tags like the tech category and whether the article is a blog or publication based off the text of the piece). The results are very promising – even with such a small corpus of training documents the classifier performed very well. I do have some ideas on how to further improve accuracy, so when time cycles free up I’ll add those to the code and rerun it on a bigger and better (in terms of tag structure) training data set. By then I’m hoping the code will look less embarassing for public release and the results will be more conclusive – but until that day here are the presentation slides:



Filed under AI, CS, Machine Learning, Research, Tagging

The Future of Tagging – Part II

Note: Refers to the ideas described in the original post

An Algorithmic Update

Just wanted to let people know that I’ve changed my algorithms/framework for hierarchical mult-labeling classification quite a bit. One thing that really bugged me about my initial idea was the error correction scheme – i.e. sampling the tag network (a bayesian/mrf hybrid) for closely related bitstrings. All the SAT/conditional probability table values in this network are generated from the number of times tags occur together in the training data, thus making my error correction scheme a popularity contest. But what about the feature values? We SHOULD take these values into account and try to reduce our new input down to a training data example with closely related feature values THAT also happens to have a similar tag bitstring (based off the prediction string outputted by the binary classifiers).

With regards to assuming there are k errors in the bitstring (call it b) we get back from the classifiers – before we sampled new candidate bitstrings based off the bitpattern produced after randomly unsetting k bits in b. Instead, since many classifiers (like the support vector I’m using) can return a probability confidence associated to the 0/1 output, my new algorithm chooses the k positions to unset not uniformly at random, but rather with a bias towards the bits with the smallest probabilities (since they are most likely the erroneous ones according to the classifiers).

Another thing I added were two tag normalization rules for determining how to choose labels:

  1. No more than one tag from each tree/hierarchy
  2. Each tag must be a leaf node in a tree

Why the rules? It provides some level of control for the placement and generality of the tags. The first one ensures there’s some separation/disjointness among the tags. And for the second – I was afraid of mixing general and very specific tags together in a grouping because it could hurt my learner’s accuracy (since the tags/labels are not on the same par). By forcing tags to be leaf nodes in the trees we sort of normalize the labels to be on the same weighted level of specificity.

Another note – when generating the tag binary classifiers, I originally proposed just taking all the files/features that map to a label grouping that contains that tag (set as the y=1 cases in the binary classifier’s training data model) and all the files/features that map to a grouping that does not contain the tag for the y=0 cases. However, this splitting up of the data seems likely to produce many bad/unnecessary features since (1) there can be a LOT of 0 cases and (2) 0 case files/examples can deal with ANYTHING, inducing their completely irrelevant features to the tag’s binary classifier’s model. But we have a way out of this dilemma thanks to the tag normalization rules above – since we can only choose a single tag from each tree, we can use all the inputs/files/training data examples that map to other leaf-node tags in the SAME tree for the zero cases. This selection of 0 case files scopes the context down to one label hierarchy/tree that contains the tag we’re trying to learn.

Anyway, I’ll try to post the pseudo code (and actual code) for my algorithms and some initial experimental results on this blog shortly. Additionally, expect a tutorial describing the steps/software I used to perform these tests.



Leave a comment

Filed under AI, CS, Machine Learning, Research, Tagging, TODO

Are professors too paper-happy?

[Update (4/26/2006): Motivation]

I've received several emails/comments (mostly from researchers and professors) regarding this post and realized some may have misconstrued my intentions for writing this article. I don't blame them. The title sounds pretty controversial and this post is quite long – compelling readers to skim and consequently miss some of the points I'm trying to make. As I mention in the second paragraph, 'I'm a huge fan of publications. The more the better.' All I'm trying to do here is make the case for why I think researchers should ALSO use more informal avenues of communication, such as blogs, for getting their ideas out for public review/commenting. These sources would not only increase reach/access, but serve as great supplementary material to their corresponding conference papers – NOT as replacements or alternatives to publications. I received a very insightful idea from Hari Balakrishnan – What if authors, in conjunction with their papers, release informal 5-6 page summaries of their projects/ideas for online public review/commenting? I would love to have resources like these available to me when dissecting the insight/knowledge encapsulated in formalized papers. I think simple ideas like these would significantly improve reachability/access of our research and even encourage more creativity/questioning.

[End Update]

On Digg:

I recently read through a professor's CV, which under the 'Publications' section stated "… published over 100 conference papers." I then proceeded to the next page of the resume to scan through some of his paper titles, only to see a brand new section titled 'References'. Wait, where'd his publications go? I went back to see if I missed a page. Nope. Wow … why wouldn't he include his citations or mention notable pieces of work? I mean, saying you have 100+ publications gives me no value unless you're going to list them. Granted, that's an amazing accomplishment – writing a paper that advances (cs) research is not only time consuming and hard work, but it requires a whole new level of creativity. Additionally, conferences are getting super competitive (unless he's publishing in these conferences) – I hear many cases of famous, tenured professors/researchers getting their papers rejected. Now multiplying this paper process by 100 represents a significant amount of research, so I'm willing to bet this guy is pretty bright. However, this one line publication description gives me the impression that this professor aims for quantity in order to amplify his prestige. I also get this feeling after publishing the 100th paper he reached one of those hard set goals that gives one the sense of 'Mission Accomplished'. I guess I'm different – I'm all about quality. Personally, I'd be content with 2 publications in my life if one of 'em got me the Turing Award and the other a date with Heidi Klum – but that's just me 🙂

Interestingly (or oddly), this CV publication description also got me thinking about some of the things I hate about (cs) papers. But first, let me make it very clear – I am a huge fan of conference publications. The more the better. The peer review process of research papers is absolutely critical to advancing science. In this post, however, I would like to make the case for why I think all researchers should ALSO use more informal avenues of communication (such as blogs, wikipedia, etc.) for publicly broadcasting their ideas and results of their latest research.

So let's start this off with a laundry list of gripes I have about 'most' papers (with possible alternatives/solutions mixed in):

PDF/PS/Sealed/doc/dvi/tex formats

  • That load time for the viewer is killer – probably deters many from even clicking the link
  • Certain viewers render diagams/text differently
  • The document looks formal, intimidating, elitist, old, plain, hardcore, not happy to see you
  • Also makes documents feel permanent, finalized, not amenable to changes
  • Provides no interface for commenting by readers – and in many cases I find reader critiques more interesting/useful than the actual paper
  • And why can't we see the feedback/comments from the conference?
  • Not pretty – It's a lot of material to read, so why not make the presentation look happier? I seriously think a nice stylesheet/web layout for paper content would significantly improve readability and portability

It's a lot of work

  • Not only does one need to come up with a fairly original idea, but research, discuss, and analyze its empirical/theoretical implications
  • Needs support/citations
  • Papers are quite formulaic – I can easily guess the page that has the nice/pretty graphs
  • This structure imposes pretty big research burdens on the authors
  • This is a GOOD thing for paper quality
  • But a terrible method for prototyping (similar to how UI dev's quickly prototype designs to cycle in user feedback)
  • Doesn't provide professors/researchers a forum to quickly get ideas out nor to filter in comments
  • Also prevents researchers from spewing their thoughts out since they wish to formalize everything in papers
  • Now there are other journals/magazines which are informal to let researchers publish their brainstorming/wacky idea articles
  • But there's still a time delay in getting those articles published
  • They still have writing standards and an approval process
  • And I don't read them (Where do I find good informal CS articles online? Is anyone linking to these? Who's authoring them?)
  • In blogs, authors can be comedic, speak freely ("he's full of it", "that technique is a load of crap" – opinions like these embedded in technical articles would making reading so much more enjoyable), and quickly get to the point without having to exhaust the boring implementation details


  • Although papers normally include author emails, this means feedback is kept privately in the authors' inboxes, not viewable to the general public
  • Many papers/journals require subscriptions/fees


  • The main issue here is professors/researchers want to publish in prestigious papers/journals
  • Rather than waste their time with things like weblogs, which are perceived to be inherently biased and non-factual, and where the audience may seem 'dumber'
  • It's in their best interest to focus on publications – get tenure, fame, approval from experts in the field

I want informality

  • But it sucks for us ('us' being the general audience who may wish to learn about these breakthroughs)
  • I love hearing professors ramble off their ideas freely
  • I want to see commentary from readers and myself ON these articles
  • And I want to see these ideas get posted and updated quickly
  • I want to see these experts explaining their ideas in simple terms (whenever possible, if it's too hardcore then it's probably not very good bloggable material) and describe the real world impacts/examples
  • But unfortunately NONE of my favorite professors/researchers publish their ideas on blogs

Popularity and Reach

  • There isn't much linkability happening for papers (besides from class/professor web sites)
  • You don't see conference papers getting linked on Slashdot/Digg
  • If millions of geeks don't even want to link/read this stuff, how and why would others? (refer below to the concluding section for why I think this is important)
  • Papers are formal, old-school, and designed to be read by expert audiences
  • They are written to impress, be exhaustive/detailed, when in reality most of us are lazy, want to read something entertaining, and get to the dang point
  • Wouldn't it be nice if professors/researchers expressed their ideas/research in a bloggish manner whenever possible?
  • At best a blog post would be a great supplemental/introductorial reading to the actual paper
  • Even the conference panel can check the blog comments to see if they missed anything
  • Some professors express their ideas informally with lecture notes and powerpoint presentations
  • But again, these formats don't let others annotate it with their comments
  • They are mostly pdf/ps/ppt/doc's (ah that load time)
  • And lecture notes usually exist for core concepts, not experimental/recent ideas


  • Which brings me up another interesting idea
  • Where's THE place to learn about core, fundamental cs topics?
  • Or even slightly maturing cs algorithms/techniques?
  • I tend to follow this learning process:
    • Check Wikipedia
    • At best gives me a great introduction to a topic
    • If i need more, I use what I learned from wikipedia/other sources to build queries for searching lecture notes/presentations
    • Typically the interesting details are hidden in a set of lecture notes/papers/books
    • And which of these sources should I use? – I have to read many versions coming from different professors to piece together a clear story
  • Wouldn't it be nice to have a one stop shop
  • Wikipedia!
  • What if Universities MOTIVATED professors/researchers to collaborate on wikipedia to publish in-depth articles regarding their areas of interest?
  • This would be huge
  • Wikipedia is incredibly popular/useful/successful, so let's use it as the place where professors around the world can unite their knowledge
  • Would be the best supplement (even replacement) for lecture notes
  • And for more experimental/controversial topics, researchers can use individual blogs

A slight digression and concluding remarks: The Future of Computer Science Research

Two things:

  1. I strongly believe the future of (computer) science relies on making our stuff more accessible to others – requiring us to tap formal AND informal avenues of communication
  2. We need to be more interdisciplinary!

Many people (especially prospective computer science students) ask me:

What's left to invent?

  • Most research to them (and me) sounds like 'optimizations' based work
  • Is there anything revolutionary left to do? Has it all been accomplished?
  • This is a hard question to answer convincingly, especially since if there was something revolutionary left to do, someone (hopefully me) would be on it
  • It's also difficult to forsee the impacts of ongoing research
  • Big ideas/inventions are evolutionary, piecing together decades of work.

They even say … "if only I could go back in time and invent the TV, PC, Internet, Web Search …"

  • As if back in the day we were living in the stone age and there was so many things left to do
  • There are even more things left to do today!

Our goal should not be to come up with an idea that surpasses the importance of say the invention of the PC

  • (I'm not even sure if this is possible, and at best is an ill comparison to make)
  • But rather to learn about the problems people face NOW and use our knowledge of science to solve them
  • Research should be entrepreneurial: Find a customer pain and provide a method to solve it
  • Not the other way around: Playing with super geeky stuff for fun and hoping later it might have applications (there are exceptions to this, since it encourages depth/exploration of a field which may lead to an accidental discovery of something huge, but I still think for the most part this is the wrong way to go about research)
  • Your audience is the most important element of research
  • And the beauty of our field – IT is a common theme in every research area!
  • We're the common denominator!
  • We can go about improving research in any subject

Now getting back to focus …

  • We consider the PC, Internet, etc. to be revolutionary because they are intuitive, fundamental platforms
  • Current research adds many layers of complexity to these ideas in order to solve more difficult problems
  • Making things more and more complex
  • What we need to be doing better is making our complex systems easier to use
  • We need better integration
  • And expand our research applications into other fields, rather than just reinforcing them back into computer science areas
  • We need to be interdisciplinary!
  • We have yet to even touch the surface when it comes to exploring the overlaps of fields

Some of the things we've been brewing since the 1970's could seriously REVOLUTIONIZE other industries

  • Imagine machine learning a database filled with all past medical histories/records/diagnostic reports
    • So when presented with symptoms/readings of a new patient, the system will tell you what that patient is suffering from with high probability based off analyzing millions of records
    • This would dramatically decrease the number of cases of death due to bad diagnosis and lower medical/insurance costs since it wouldn't be necessary to run a bunch of expensive scans/tests and surgeries to figure out what's wrong with a patient
  • A similar system could do the same thing over economics, disease, environmental, hazards datasets so that scientists and policy makers can ask questions like 'In the next five years what region of the world has the highest probability of a natural disaster … or a disease epidemic?, etc.'
    • This would be huge in shaping policy priorities, saving human lives, and preparing/preventing disasters like Katrina from ever escalating
  • Or what about mobile ad hoc networks to let villages in the poorest regions of the world wirelessly share an internet connection
  • Or sensor networks to do environmental sensing, health monitoring in hospitals, traffic control, pinpointing the location of a crime, etc.
  • And plenty, plenty more

The attractive elements of computer science research is not necessarily the algorithms, nor the programming implementation

  • But rather the intelligence/knowledge it can provide to humans after partying on data
  • Information is what makes a killer application!
  • And every field has tons of interesting information
  • We need to INTERACT and learn more about the needs in other fields and shape our research accordingly
  • (Unfortunately, computer scientists have a somewhat bad reputation when it comes to any form of 'interaction')

So after hearing all this, the prospective science students complain:

  • "Now we have to learn about multiple fields – our research burden is so much bigger than the scientists' in the past"
  • This is a very silly argument for defending one's laziness to pursue (computer) science
  • I mean I much rather hear one of them 'I don't want to get outsourced to India' excuses 🙂
  • Here's why:
    • We got it GOOD
    • I mean, geez, we got Google!
    • I can't even begin to imagine what Einstein, Gauss, Turing, Newton, etc. could have accomplished if they had the information research tools we got
    • We could have flying cars, cleaner environment, robots doing our homework
    • Heck, they could have probably of found a way to achieve world peace
    • We take what we have for granted
  • Back in the day, where there was no email, no search, no unified, indexed repository of publications
  • Scientists had to do amazing amounts of research just to find the right sources to learn from
  • Travel around the world to meet with leading researchers
  • Read entire books to find relevant information
  • You see, scientists back then had no choice but to be one-dimensional, and those who went deep into different fields either had no lives or were incredibly brilliant (most likely both)
  • But we can do all this work they had to do just to prepare for learning in a matter of seconds, from anywhere!
  • It's utter crap to think we can't learn about multiple fields of interest, in-depth
  • And our research tools will only get better and better exponentially faster
  • But I don't blame these students for their complaint
  • We're still trained to be one-dimensional
  • Universities have specific majors and graduate programs (although some of these programs are SLOWLY mixing together)
  • Thereby requiring many high school students to select a single major on the admissions application
  • Even though their high school (which focuses on breadth) probably did a terrible job introducing them to liberal arts and science
  • [Side Note: Actually, this is one of things I never quite understood. How do students choose a major like EE/NuclearEng/MechEng on the admissions application, when their high school most likely doesn't offer any of those courses? Sounds like their decision came down to three factors (1) The student is a geek (2) Parent's influence (3) Money. 2/3 of these are terrible reasons to enter these fields. Even worse, many universities separate engineering from liberal arts, making it almost impossible for a student who originally declared 'Economics' to switch to BioE after taking/liking a biotech class – despite the fact that their motivation to enter the major is so much better. You should enter a field you love, and making the decision on which students can enter engineering majors based off high school, which gives you almost zero exposure to these fields, makes no sense. And we wonder why we have a shortage of good engineers … well one reason is probably because we don't let them into the university programs – 'them' being the ones who actually realized they liked this stuff when they got to college.]
  • Many job descriptions want students/applicants with skills pertaining to one area

But unfortunately the problems we now face require us to understand how things mix together

  • This is why it's very important we increase the levels communications between fields to promote interdisciplinary research
  • Like a simple example: Databases and IR
    • Both have the same goal: To answer queries accurately regarding a set of data
    • Despite this, they are seen as two traditionally separate areas
    • Database/IR people normally work on different floors/buildings at companies/universities
    • One system is based traditionally more on logic while the other is more statistical/probablistic
    • But exploiting their overlap could greatly improve information access (a la SQL Server 2005)
    • Notice this example refers to two subtopics under just computer science, not even two mainstream fields!
  • Just imagine the impacts of collaborating more with other fields of study

This is why I feel blogging, wikipedia, and informal avenues of communicating our thoughts/research/ideas is very important

  • So that people in other fields can read them
  • And for the laundry list of reasons above
  • Even if it's just a proof of concept technique with no empirical studies
  • An interested geek reader who came across the blog post might just end up coding it up for you
  • Could even inspire start-up companies

So the POINT: Do whatever you can to voice your ideas/research to the largest possible audience in the hopes of cross fertilizing research in many fields.

* So what's up with the parentheses around 'cs' and 'computer' in this post? Well, many of these points generalize to all (science) research publications 🙂

This work is licensed under a Creative Commons License



Filed under Blogging, CS, Education, Non-Technical-Read, Publications, Research, Science, Wikipedia

The Future of Tagging

Update (9/25): A Short Presentation On This (forgot to link)
Update (5/06): An Algorithmic Update
Update (3/29): Added a Motivation section


After reviewing the comments/emails I’ve received so far, I realized my article could make the motivation clearer. As I’ve mentioned in this post/comments, I agree with everyone regarding basic tagging – it’s by far the simplest way for any user to organize their media/text so that others can search/relate their content. I don’t see this ever going away. However, I decided to take a step back in this article and look at the issues with the current tagging model and examine an alternative method, namely hierarchical multi-labeling. Hierarchical multi-labeling solves many of the issues concerning basic tagging and should lend to better auto-tagging algorithms since it tells us how tags are related to each other. I definitely agree this isn’t something we should expect the average user to perform – but I do think power users and content aggregators like Google News could benefit greatly from this tagging model.

One of my goals this semester is to let anyone (most likely a hardcore geek) pass a tag network and training data (both represented as files) to my web service and I generate for them a classifier page (with a search box that takes in a webpage link or a string of words). Click ‘Classify’ and it’ll return the best set of tags for that input based off the training data examples. Services can use this to classify their news articles, emails, bookmarks, etc. Notice the grunt work (which isn’t too bad) is done by a geek, but all users can benefit from such a system.

Anyway, hope you enjoy the read and please comment if possible.

On Digg:,_and_how_to_fix_it

Two ways we can improve upon tagging:

  1. Support label hierarchies/groupings
  2. Use multiple (but ‘normalized’, explained below) labels per object

Why Hierarchical Tags?

Many app’s make use of a single level hierarchy of tags

  • Ex’s: Flickr, Gmail, Google Base, Technorati, WordPress, Delicious, YouTube
  • Big buzz around tagging
  • Finally provides users a simple way to structure their unstructured content (typically text, photos, video, and music media)
  • Makes it easier to search for related items

But one level hierarchy has issues

  • Namespace – redundant/similar named tags treated differently – wasteful
  • Not normalized – tags are not equal, general and specific in the same level
  • Loses relationships among different tags (how is tag x and tag y causally related?)
  • One level hierarchy labels are really just keywords
  • Not much different than a concise textual description
  • Only provides structural support if multiple documents use same tags
  • In it’s current form, it’s an unstructured way of structuring unstructured content
  • But better than nothing

Very simple Ex: Gmail has a one level hierarchy of tags

  • I get an email regarding my CS294 class, so I tag it with ‘CS294’
  • However, I would also like to be able to search through all emails
    relevant to school, so I have to tag it with label ‘Coursework’
  • But ‘Coursework’ encompasses ‘CS294’
  • I have to redundantly add Coursework to each CS294 tagged email even though ‘CS294’ implies ‘Coursework’
  • I could label it ‘Coursework\CS294’, but that’s just one tag specific to
    CS294, I can’t separate out the ‘Coursework’ part for search
  • This slashing technique, popularized in many Delicious sites, provides hierarchy in name only
    • Fails to capture any practical benefits like searching or relating different hierarchies

Why Multi-Label?

Objects typically (and should) fall under several categories

Many cool AI applications/data representations motivate multiple labels:

  • Medical Diagnosis
    • Real life Dr. House without the attitude
    • Normally many causes/diseases associated to a set of health features
    • Help diagnosticians to narrow down on the most likely set of causes
  • Computer Vision (i.e. Flickr, Riya,,
    • For ex. Espgame/Peekaboom collect many labels for images/pixels
    • Could use their data to train computer vision learners for auto-tagging
  • Email/Filesystems/Databases/Search (i.e. Gmail, WinFS, SQL, Google Base)
    • File/Directory concept outdated
    • ‘Files’ are really objects which have metadata and relational aspects
    • Multi-Labels present a great, simple way to structure the diverse unstructured content in a file
    • (side-note: hierarchical tags could be used to provide backwards compatibility with File/Directory)
  • News (i.e. Google News, Digg, CNN, NY Times, Slashdot,
    • Multiple (hierarchical) labels for each news piece
    • Like seeing these labels { News.Tech.Computers.Hardware; Ideology.Geeky.Anti-Microsoft; Source.Online.Blog; People.Bill-Gates }
      … tells me a ton about an article before even reading it
    • Plus I can now search/relate these tags to find similar news articles based off specific attributes

Let’s get a bit more technical

We organize labels into trees (gives us hierarchies)

Per object, we choose multiple labels if each label comes from a different tree (hence ‘normalized’, provides some degree of separation/independence of tags)

So, what’s the point of adding all this complexity to tagging

One of the nice benefits of tagging is it’s so simple

I agree: I’m not expecting mommy and daddy to do hierarchical multi-labeling

But content providers can do this to reap the benefits described above

AND, because it will help our artificial intelligence algorithms learn how to multi-tag objects automatically (mainly because we know how tags are related to each other)

A possible machine learning algorithm for hierarchical multi-labeling


We’ll build this algorithm based off binary supervised classifiers because:

  • Well understood in theory & practice; simpler, best accuracy
  • Many multiclass classifiers actually use several pairwise (all-pairs, one-versus-all, etc) binary classifiers
  • Many algorithms to work with: Perceptron, Kernels (Support Vector Machines), Neural Nets, Decision Trees, etc.

Want to create a Bayesian network based off the tag trees (actually it’s more like a Markov random field since there are undirected edges between tree nodes which occur together in the training data, annotated with CPT/SAT-based representations describing the causalities)

Ex. of a Tag Network


  • Sports
    • Editorial


  • Liberal
    • Democrat
    • Marxist
  • Geeky
    • Anti-Microsoft
  • Nationalism
    • Brazil
    • USA


  • Outdoor
    • Baseball
    • Football
  • Indoor
    • Soccer


  • Paper
    • Magazine
  • Online
    • Blog
      • Wikipedia

* Does not show (undirected) links between hierarchies (like
News.Sports to Ideology.Nationalism) since it’s hard to show in text

Ex. Training Data

x1=3.45, x2=2.10, x3=5.45, x4=0.20, x5=9.20

y =

  1. News.Sports.Editorial
  2. Ideology.Nationalism.Brazil
  3. Athletics.Indoor.Soccer
  4. Source.Online.Wikipedia

x1=1.25, x2=6.93, x3=3.11, x4=8.01, x5=0.20


  1. News.Tech.Computers.Hardware
  2. Ideology.Geeky.Anti-Microsoft
  3. Source.Online.Blog

How to fill in the CPT values for each bayes node in the Tag network?
We just count the tag groupings in the training data and use these numbers to generate a distribution


  • Create a Kernel (Support Vector) machine based binary classifier for each distinct tag
  • Train each binary classifier with the features from the training data whose y contains the tag (set classifier’s y = 1 for each of these feature sets)
    and with features that do not contain the tag (set classifier’s y = 0 for each)

    • (side-note: also known as one-versus-all approach, most common multiclass method)


We run a new feature set through the set of binary classifiers, which each output a 0 or 1

Now we could just use this bitstring to immediately return a set of tags (the tags associated to 1 bits), and several existing Multi-Label approaches do this, but I think we can do better

The two main issues I have using this bitstring directly:

  1. The binary classifiers treat the tags independent of one another
  2. We don’t know which features correlate to which tags, and to what degree
    • Therefore we may be using irrelevant features (since we use them all) for training the binary classifiers, which hurts accuracy

These issues introduce errors in our bitstring b

However, we can use the tag relationships in our bayesian network to correct b

This problem lends nicely to an information theoretical approach

  • We received b over a noisy channel
  • Use what we know about tag relationships to reduce error (i.e. Hamming distance from the actual bitstring)

Reducing Error

There are several ways to go about error correcting b, here’s the one I came up with:
(any feedback esp. here would be great)

  • Use a Gibb’s (MCMC) based sampling scheme to generate candidate bitstrings from b
  • Nice convergence properties
  • Not knowing which bits are wrong in b motivates randomized/sampling methods
  • For n times, randomly choose a bit and flip it proportional to its probability in its CPT, output the new bitstring
  • This scheme occasionally alternates from sampling new bitstrings based off previously ‘sample generated’ bitstrings and off the original b (could alternate after k iterations, where k is the median hamming distance score from the strings produced by the binary classifiers compared against the actual strings for a hidden training data set)

Now we wish to find ‘interesting’ bitstrings from our set of n

  • By ‘interesting’ I mean bitstrings with frequent itemsets (related)
  • Use Apriori data mining algorithm to find these bitstrings, call the returning set s
  • Then over the bitstrings in s, scan for bits that have the same assignments and for 1 bits

Run a diagnosis (or MAP) query over the tags assigned 1 conditioned on tags assigned the same value in every bitstring, which returns our desired tag assignments

Here’s an Ex.:

Say the bitstrings we get from the Apriori algorithm are:

0 1 0 1 0 1 0
0 1 1 1 0 1 0
0 1 0 1 0 0 1

I scan the bits and see A is 0, B is 1, E is 0, in all the bitstrings

I also see C, F, G were assigned 1 in at least one bitstring

So I run this MAP query over my bayes network to find the assignment that maximizes:

Pr(C=?, F=?, G=? | A=0, B=1, E=0)

and return the tags B (since it was assigned 1 everywhere) and whichever ones will be assigned 1 by this query

Bayesian Inference and Diagnosis

Our previous step has left us with a very difficult problem to solve

  • Just doing plain ol’ bayesian inference queries is #P complete
  • But MAP queries are even harder since they need to infer the probabilities of all possible assignments

But luckily for me, I wanted to reduce a fun problem down to a MAP query

  • Let’s me relate this project to some very interesting class material 🙂

Prof. Satish Kumar in CS294 lectured on an exciting method for quickly computing inference/MAP queries

Utilizes several interesting tricks:

  • Precompiling the bayesian network into an efficient SAT-based representation
  • AND’ing each DNF clause with the query terms
  • And then counting the # of solutions to each clause using Karp’s FPRAS algorithm
  • Adding up the counters (each multiplied by some factor) solves your inference queries!

Can do DNF sampling (Las-Vegas or Atlantic-City style) under the same scheme for computing diagnosis queries

(references to learn more: starting on pg. 179)

So, if we just precompile our tag network (before ever learning/predicting labels) then at runtime we can answer MAP queries in polynomial time w.r.t. the size of the SAT-based representation & exponential w.r.t the size of the largest communication link in our clique tree (an optimization that basically caches variables shared
between family nodes)

Pretty cool!

(A nice side-note property: the more variable assignments we condition on, the more we reduce the size of our SAT-based representation)


Everything is Parallelizable!

  • Counting the tag occurences in the training data to populate the CPT’s in the bayes network (split by record)
  • Generating the set of binary classifiers and running a new feature set over the binary classifiers (since they are independent)
  • Induced correction sampling of bitstrings
  • Apriori is parallelizable
  • Scanning the bitstrings returned from Apriori for commonalities (split by bitstring, for each tag just output 1/0, then reduce the sum and check it with the total # of bitstrings)
  • Even Satish’s inference/diagnosis algorithms (split by DNF clause)

Decouples tag relations and learning

  • Our tag network does not condition on specific feature values
  • This could be seen as a bad thing but learning the causalities of specific feature values to a label (or group of labels) sounds very difficult/messy & at best application limiting since it assumes we:
    1. Know all (or even a subset of) the possible values each of the features can take
    2. Have a learning method that maps each of these individual values to a tags distribution
  • However, binary classifiers implicity find discrepencies in particular features in order to differentiate data inputs
  • In our model, the user just needs to have tag-specific binary classifiers
  • They don’t need to incorporate causality with other tags in their one-tag specific learner
    • Avoiding MultiClass learners (which aren’t nearly as well-understood nor as accurate as binary classifiers)
    • Additionally, our paradigm lets users plug-in their favorite binary classifier (many to choose from, as mentioned in Motivations)
  • Our model lets users simply construct tag dependency trees without having to explicity show why those links exist

By making the problem more complex we might actually be improving learning accuracy

  • Tag networks, hierarchal labeling, & multi-labeling – added complex functions to the system – should improve learning
  • Let’s us now exploit the overlap of tags
  • Multi-Labeling increases the chances that our learner can predict one, a subset, or all the tags correctly for a new input
    • May be a better pay off than the currently popular paradigm of where you either get the label right or you don’t


This work is licensed under a Creative Commons License



Filed under AI, Databases, Google, Machine Learning, Research, Tagging

Want to help the AI community? Play games!

Last week I heard Luis von Ahn‘s talk on CMU’s Human Computation project, which utilizes Human cycles to solve significant AI-Complete problems like Computer Vision and NLP. How do they enlist the help of free human labor? By wrapping these problems into addictive, interactive, web based games. Genius. For example, their ESP Game gives two remote players a series of identical images to label. For each label they both match, the higher their scores.

So, how does this help advance AI research? Well, despite our major breakthroughs in computer hardware and theory, we haven’t quite figured out a robust method to understanding the contents of a picture. This is a big problem because most pictures on the web don’t come with sufficient metadata and often have insensitive file names like ‘3.jpg’. However, unlike machines, us humans can easily categorize a set of images. These labels would significantly help search engines like Google improve their image rankings and even enable users to search for related images.

Other interesting remarks from the talk – In just the past year humans spent over 9 billion hours playing Windows Solitaire. 9 billion! ASDFLKQOW!! To put this into perspective, it took 5 million hours to construct the Empire State building and 20 million hours for the Panama Canal. That’s ridiculous. Luis estimates that if just 5,000 humans played ESP for a 2 month’s time, all the images in Google search engine could be accurately labeled.

So, the moral of this post – if you’re bored and got a few human cycles to spare, try playing these ‘productive’ games.

1 Comment

Filed under AI, Games, Google, Research