This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).
Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).
I find the distance function for the sql templates very interesting. I don’t have access to the acm paper ref [16]. Could you send me more details on the N-gramming you did to get these results? How many tokens did you consider?
I just found your .PPT below with many more details. Would still like to get more references on how to do unstructured (or less structured) distance measures.
Thanks for the comments. This work was primarily inspired by Mark Manasse’s paper on syntactic clustering of web documents using a Jaccard metric. The references we used to do this can be found in the references slide in my .PPT (right before the Q/A slide). If you’re interesting in more unstructured distance functions, I would recommend looking for any work discussing cosine (where you compute the dot product of the word sequence frequency vectors) and Jaccard based metrics – those are the two primary flavors for computing text distance. Also, note that in our work we used the concept of a clustroid, where you use the document feature vector itself to represent the center – as opposed to using the midpoints of a bunch of numerical vectors, which we don’t/can’t do since our feature vectors are composed of strings not counts. Furthermore, the multi-set Jaccard metric we use (where we append to repeated strings a rolling integer – since traditional sets will discard duplicates, thereby ignoring term frequency which is typically important for term weights) is relatively new I think. There has been recent work by Mark on efficient ways to factor term weights in the Jaccard metric (his paper on consistent weighting which you can find on his MSR homepage). Hope this is helpful. Feel free to send me any specific questions.