Note: Refers to the ideas described in the original post
An Algorithmic Update
Just wanted to let people know that I’ve changed my algorithms/framework for hierarchical mult-labeling classification quite a bit. One thing that really bugged me about my initial idea was the error correction scheme – i.e. sampling the tag network (a bayesian/mrf hybrid) for closely related bitstrings. All the SAT/conditional probability table values in this network are generated from the number of times tags occur together in the training data, thus making my error correction scheme a popularity contest. But what about the feature values? We SHOULD take these values into account and try to reduce our new input down to a training data example with closely related feature values THAT also happens to have a similar tag bitstring (based off the prediction string outputted by the binary classifiers).
With regards to assuming there are k errors in the bitstring (call it b) we get back from the classifiers – before we sampled new candidate bitstrings based off the bitpattern produced after randomly unsetting k bits in b. Instead, since many classifiers (like the support vector I’m using) can return a probability confidence associated to the 0/1 output, my new algorithm chooses the k positions to unset not uniformly at random, but rather with a bias towards the bits with the smallest probabilities (since they are most likely the erroneous ones according to the classifiers).
Another thing I added were two tag normalization rules for determining how to choose labels:
- No more than one tag from each tree/hierarchy
- Each tag must be a leaf node in a tree
Why the rules? It provides some level of control for the placement and generality of the tags. The first one ensures there’s some separation/disjointness among the tags. And for the second – I was afraid of mixing general and very specific tags together in a grouping because it could hurt my learner’s accuracy (since the tags/labels are not on the same par). By forcing tags to be leaf nodes in the trees we sort of normalize the labels to be on the same weighted level of specificity.
Another note – when generating the tag binary classifiers, I originally proposed just taking all the files/features that map to a label grouping that contains that tag (set as the y=1 cases in the binary classifier’s training data model) and all the files/features that map to a grouping that does not contain the tag for the y=0 cases. However, this splitting up of the data seems likely to produce many bad/unnecessary features since (1) there can be a LOT of 0 cases and (2) 0 case files/examples can deal with ANYTHING, inducing their completely irrelevant features to the tag’s binary classifier’s model. But we have a way out of this dilemma thanks to the tag normalization rules above – since we can only choose a single tag from each tree, we can use all the inputs/files/training data examples that map to other leaf-node tags in the SAME tree for the zero cases. This selection of 0 case files scopes the context down to one label hierarchy/tree that contains the tag we’re trying to learn.
Anyway, I’ll try to post the pseudo code (and actual code) for my algorithms and some initial experimental results on this blog shortly. Additionally, expect a tutorial describing the steps/software I used to perform these tests.