Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [TechCrunch] [Mashable] [zooie], so I thought I’d focus more on the content here.

### Duplication

By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.

Compression ratio

>>> 284023259 / 739273532 bytes

0.38419238171778614

Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.

We can dive even deeper into this area by analyzing the term overlap statistics to measure near duplication, or messages that aren’t necessarily identical but are close enough.

To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, after cleaning the text, the average number of tokens for a message is just 6.28, or 2.5x the size of a standard web search query.

Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N >= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).

You’ll notice N >=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.

 N Term Samples Unique Keys Coverage 8 8548695 0.8356 6 8512672 0.8321 5 8476590 0.8286 4 8366391 0.8177 3 8098400 0.7916 2 5716566 0.5588 1 1013783 0.0991

### URLs

URLs are present in ~18% of the tweets

Of those, ~65% of the URLs are unique

70K Unique Domains covering 2M URLS

Top Domains:

[‘bit.ly’, ‘tinyurl.com’, ‘twitpic.com’, ‘is.gd’, ‘myloc.me’, ‘ow.ly’, ‘ustre.am’, ‘cli.gs’, ‘tr.im’, ‘plurk.com’, ‘ff.im’, ‘tumblr.com’, ‘yfrog.com’, ‘140mafia.com’, ‘u.mavrev.com’, ‘twurl.nl’, ‘tweeterfollow.com’, ‘mypict.me’, ‘viagracan.com’, ‘vipfollowers.com’, ‘morefollowers.net’, ‘digg.com’, ‘tweeteradder.com’, ‘ping.fm’, ‘tiny.cc’, ‘followersnow.com’, ‘short.to’, ‘twit.ac’, ‘snipr.com’, ‘wefollow.com’, ‘tweet.sg’, ‘url4.eu’, ‘the-twitter-follow-train.info’, ‘fwix.com’, ‘budurl.com’, ‘su.pr’, ‘shar.es’, ‘tinychat.com’, ‘snipurl.com’, ‘loopt.us’, ‘migre.me’, ‘flic.kr’, ‘myspace.com’, ‘snurl.com’, ‘twitgoo.com’, ‘zshare.net’, ‘post.ly’, ‘bkite.com’, ‘yes.com’, ‘flickr.com’, ‘twitter.com’, ‘artistsforschapelle.com’, ‘140army.com’, ‘youtube.com’, ‘x.imeem.com’, ‘pic.gd’, ‘TwitterBackgrounds.com’, ‘raptr.com’, ‘twt.gs’, ‘twitthis.com’, ‘mobypicture.com’, ‘tobtr.com’, ‘ad.vu’, ‘sml.vg’, ‘rubyurl.com’, ‘tinylink.com’, ‘redirx.com’, ‘a2a.me’, ‘eCa.sh’, ‘vimeo.com’, ‘meadd.com’, ‘hotjobs.yahoo.com’, ‘doiop.com’, ‘myurl.in’, ‘urlpire.com’, ‘buzzup.com’, ‘freead.im’, ‘youradder.com’, ‘facebook.com’, ‘adf.ly’, ‘justin.tv’, ‘twitvid.com’, ‘adjix.com’, ‘twcauses.com’, ‘lkbk.nu’, ‘tlre.us’, ‘htxt.it’, ‘stickam.com’, ‘twubs.com’, ‘isy.gs’, ‘reverbnation.com’, ‘news.bbc.co.uk’, ‘sn.im’, ‘twibes.com’, ‘ustream.tv’, ‘trim.su’, ‘hashjobs.com’, ‘blogtv.com’, ‘jobs-cb.de’, ‘xsaimex.com’]

### Retweets

~4% of messages are retweets

### Replied @Users

~1M total replied-to users in this data set

37% of tweets contain ‘@x’ terms

Most Popular Replied-to Users (almost all celebrities):

[‘@mileycyrus’, ‘@jonasbrothers’, ‘@ddlovato’, ‘@mitchelmusso’, ‘@donniewahlberg’, ‘@souljaboytellem’, ‘@tommcfly’, ‘@addthis’, ‘@officialtila’, ‘@johncmayer’, ‘@shanedawson’, ‘@bowwow614’, ‘@jordanknight’, ‘@ryanseacrest’, ‘@perezhilton’, ‘@jonathanrknight’, ‘@petewentz’, ‘@tweetmeme’, ‘@adamlambert’, ‘@david_henrie’, ‘@dealsplus’, ‘@dwighthoward’, ‘@iamdiddy’, ‘@lancearmstrong’, ‘@songzyuuup’, ‘@imeem’, ‘@blakeshelton’, ‘@dannymcfly’, ‘@lilduval’, ‘@selenagomez’, ‘@markhoppus’, ‘@yelyahwilliams’, ‘@therealpickler’, ‘@stephenfry’, ‘@mrtweet.’, ‘@taylorswift13’, ‘@michaelsarver1’, ‘@davidarchie’, ‘@the_real_shaq’, ‘@tyrese4real’, ‘@britneyspears’, ‘@106andpark’, ‘@ashleytisdale’, ‘@mariahcarey’, ‘@kimkardashian’, ‘@wale’, ‘@mashable’, ‘@programapanico’, ‘@therealjordin’, ‘@listensto’, ‘@misskeribaby’, ‘@alyssa_milano’, ‘@alexalltimelow’, ‘@aplusk’, ‘@thisisdavina’, ‘@breakingnews:’, ‘@peterfacinelli’, ‘@truebloodhbo’, ‘@mgiraudofficial’, ‘@tonyspallelli’, ‘@mtv’, ‘@jackalltimelow’, ‘@dfizzy’, ‘@youngq’, ‘@tomfelton’, ‘@pooch_dog’, ‘@jonaskevin’, ‘@princesammie’, ‘@nkotb’, ‘@christianpior’, ‘@cthagod’, ‘@johnlloydtaylor’, ‘@neilhimself’, ‘@moontweet’, ‘@katyperry’, ‘@danilogentili’, ‘@mchammer’, ‘@rainnwilson’, ‘@joeymcintyre’, ‘@30secondstomars’, ‘@phillyd’, ‘@heidimontag’, ‘@mrpeterandre’, ‘@andyclemmensen’, ‘@crystalchappell’, ‘@kevindurant35’, ‘@huckluciano’, ‘@dannygokey’, ‘@jaketaustin’, ‘@revrunwisdom’, ‘@jamesmoran’, ‘@musewire’, ‘@dannywood’, ‘@nickiminaj’, ‘@akgovsarahpalin’, ‘@terrencej106’, ‘@mashable:’, ‘@drewryanscott’, ‘@mrtweet’, ‘@necolebitchie’, ‘@lilduval:’, ‘@willie_day26’, ‘@kirstiealley’, ‘@betthegame’, ‘@radiomsn’, ‘@alancarr’, ‘@rafinhabastos’, ‘@krisallen4real’, ‘@iamjericho’, ‘@breakingnews’, ‘@babygirlparis’, ‘@ladygaga’, ‘@chris_daughtry’, ‘@hypem’, ‘@danecook’, ‘@imcudi’, ‘@jeepersmedia’, ‘@buckhollywood’, ‘@kimmyt22’, ‘@giulianarancic’, ‘@chrisbrogan’, ‘@nasa’, ‘@addtoany’, ‘@nickcarter’, ‘@debbiefletcher’, ‘@marcoluque’, ‘@shaundiviney’, ‘@ogochocinco’, ‘@twitter’, ‘@eddieizzard’, ‘@youngbillymays’, ‘@real_ron_artest’, ‘@pink’, ‘@laurenconrad’, ‘@rubarrichello’, ‘@ianjamespoulter’, ‘@liltwist’, ‘@teyanataylor’, ‘@dougiemcfly’, ‘@theellenshow’, ‘@robkardashian’, ‘@sherrieshepherd’, ‘@justinbieber’, ‘@paulaabdul’, ‘@jason_manford’, ‘@jaredleto’, ‘@tracecyrus’, ‘@itsonalexa’, ‘@ddlovato:’, ‘@khloekardashian’, ‘@revrunwisdom:’, ‘@solangeknowles’, ‘@allison4realzzz’, ‘@nickjonas’, ‘@reply’, ‘@anarbor’, ‘@donlemoncnn’, ‘@gfalcone601’, ‘@moonfrye’, ‘@symphnysldr’, ‘@iamspectacular’, ‘@honorsociety’, ‘@questlove’, ‘@guykawasaki’, ‘@dawnrichard’, ‘@_maxwell_’, ‘@somaya_reece’, ‘@mandyyjirouxx’, ‘@teemwilliams’, ‘@greggarbo’, ‘@pennjillette’, ‘@mikeyway’, ‘@matthardybrand’, ‘@iamjonwalker’, ‘@andyroddick’, ‘@kohnt01’, ‘@chris_gorham’, ‘@seankingston’, ‘@joshgroban’, ‘@mousebudden’, ‘@misskatieprice’, ‘@spencerpratt’, ‘@wilw’, ‘@jgshock’, ‘@swear_bot’, ‘@joelmadden’, ‘@techcrunch’, ‘@americanwomannn’, ‘@kelly__rowland’, ‘@mionzera’, ‘@astro_127’, ‘@_@’, ‘@spam’, ‘@sookiebontemps’, ‘@drakkardnoir’, ‘@noh8campaign’, ‘@kayako’, ‘@trvsbrkr’, ‘@qbkilla’, ‘@mw55’, ‘@guykawasaki:’, ‘@donttrythis’, ‘@cv31’, ‘@liljjdagreat’, ‘@tiamowry’, ‘@nickensimontwit’, ‘@holdemtalkradio’, ‘@bradiewebbstack’, ‘@nytimes’, ‘@riskybizness23’, ‘@radityadika’, ‘@adrienne_bailon’, ‘@riccklopes’, ‘@jessicasimpson’, ‘@sportsnation’, ‘@jasonbradbury’, ‘@huffingtonpost’, ‘@oceanup’, ‘@gilbirmingham’, ‘@iconic88’, ‘@the’, ‘@thebrandicyrus’, ‘@gordela’, ‘@thedebbyryan’, ‘@jessemccartney’, ‘@?’, ‘@caiquenogueira’, ‘@celsoportiolli’, ‘@shontelle_layne’, ‘@calvinharris’, ‘@chattyman’, ‘@ali_sweeney’, ‘@anamariecox’, ‘@joshthomas87’, ‘@emilyosment’, ‘@nasa:’, ‘@sevinnyne6126’, ‘@thebiggerlights’, ‘@theboygeorge’, ‘@jbarsodmg’, ‘@goldenorckus’, ‘@warrenwhitlock’, ‘@bobbyedner’, ‘@myfabolouslife’, ‘@descargaoficial’, ‘@ochonflcinco85’, ‘@ninabrown’, ‘@billycurrington’, ‘@oprah’, ‘@junior_lima’, ‘@asherroth’, ‘@starbucks’, ‘@jason_pollock’, ‘@intanalwi’, ‘@harrislacewell’, ‘@serenajwilliams’, ‘@kevinruddpm’, ‘@bigbrotherhoh’, ‘@oliviamunn’, ‘@chamillionaire’, ‘@tamekaraymond’, ‘@teamwinnipeg’, ‘@littlefletcher’, ‘@piercethemind’, ‘@brookandthecity’, ‘@iranbaan:’, ‘@tonyrobbins’, ‘@maestro’, ‘@glennbeck’, ‘@1omarion’, ‘@nadhiyamali’, ‘@slimthugga’, ‘@jason_mraz’, ‘@profbrendi’, ‘@djaaries’, ‘@juanestwiter’, ‘@davegorman’, ‘@zackalltimelow’, ‘@mamajonas’, ‘@itschristablack’, ‘@skydiver’, ‘@gigva’, ‘@currensy_spitta’, ‘@paulwallbaby’, ‘@rpattzproject’, ‘@petewentz:’, ‘@rodrigovesgo’, ‘@drdrew’, ‘@sportsguy33’, ‘@cthagod:’, ‘@hollymadison123’, ‘@mjjnews’, ‘@itsbignicholas’, ‘@_supernatural_’, ‘@santoevandro’, ‘@demar_derozan’, ‘@marthastewart’, ‘@billganz62’, ‘@oodle’, ‘@davidleibrandt’]

### Hashtags

~7% of messages contain hashtags

Total Unique Hashtags found: ~94k

Top Hashtags:

### Questions

Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:

5W’s, H, ? present ANYWHERE in tweet:

0.102789281948 or 10%

5W’s, H first token or ? last token:

0.0238229662219 or 2%

Just ? ANYWHERE in tweet:

0.0040984928533 or 0.4%

### Users

Discovered ~2M unique users

Top Sending Users (many bots):

[‘followermonitor’, ‘Tweet_Words’, ‘currentcet’, ‘currentutc’, ‘whattimeisitnow’, ‘ItIsNow’, ‘ThinkingStiff’, ‘otvrecorder’, ‘delicious50’, ‘Porngus’, ‘craigslistjobs’, ‘GorPen’, ‘hashjobs’, ‘TransAlchemy2’, ‘bot_theta’, ‘CHRISVOSS’, ‘bot_iota’, ‘bot_kappa’, ‘TIPAS’, ‘VeolaJBanner’, ‘StacyDWatson’, ‘LMAObot’, ‘SarahJSlonecker’, ‘AllisonMRussell’, ‘bot_eta’, ‘SandraHOakley’, ‘bot_psi’, ‘bot_tau’, ‘LoreleiRMercer’, ‘bot_zeta’, ‘bot_gamma’, ‘bot_sigma’, ‘bot_lambda’, ‘bot_pi’, ‘bot_epsilon’, ‘bot_nu’, ‘bot_rho’, ‘bot_omicron’, ‘bot_khi’, ‘LindaTYoung’, ‘mensrightsindia’, ‘bot_omega’, ‘bot_ksi’, ‘bot_delta’, ‘bot_alpha’, ‘bot_phi’, ‘CindaDJenkins’, ‘bot_mu’, ‘ImogeneDPetit’, ‘bot_upsilon’, ‘OPENLIST_CA’, ‘openlist’, ‘isygs’, ‘dq_jumon’, ‘gamingscoop’, ‘MildredSLogan’, ‘ObiWanKenobi_’, ‘pulseSearch’, ‘MaryEVo’, ‘ImeldaGMcward’, ‘MaryJNewman’, ‘SharonTForde’, ‘LoriJCornelius’, ‘BrandyWPulliam’, ‘RhondaTLopez’, ‘AprilKOropeza’, ‘CarolETrotman’, ‘SusanATouvell’, ‘dinoperna’, ‘buzzurls’, ‘_Freelance_’, ‘DrSnooty’, ‘illstreet’, ‘bibliotaph_eyes’, ‘loc4lhost’, ‘bsiyo’, ‘BOTHOUSE’, ‘post_ads’, ‘qazkm’, ‘frugaldonkey’, ‘free_post’, ‘groovera’, ‘wonkawonkawonka’, ‘ForksGirlBella’, ‘casinopokera’, ‘dermdirectoryny’, ‘Yoowalk_chat’, ‘mstehr’, ‘hashgoogle’, ‘perry1949’, ‘ensiz_news’, ‘Bezplatno_net’, ‘timesmirror’, ‘work_freelance’, ‘cockbot’, ‘pdurham’, ‘bombtter_raw’, ‘ocha1’, ‘AlairAneko24’, ‘HaiIAmDelicious’, ‘Freshestjobs’, ‘fast_followers’, ‘LeadsForFree’, ‘RideOfYourLife’, ‘AlastairBotan30’, ‘helpmefast25’, ‘TheMLMWizard’, ‘uitrukken’, ‘adoptedALICE’, ‘TKATI’, ‘ezadsncash’, ‘tweetshelp’, ‘LAmetro_traffic’, ‘thinkpozzitive’, ‘StarrNeishaa’, ‘AldenCho36’, ‘JobHits’, ‘wootboot’, ‘smacula’, ‘faithclubdotnet’, ‘DmitriyVoronov’, ‘brownthumbgirl’, ‘NYCjobfeed’, ‘hfradiospacewx’, ‘FakeeKristenn’, ‘MLBDAILYTIMES’, ‘wildingp’, ‘JacksonsReview’, ‘EarthTimesPR’, ‘friedretweet’, ‘Wealthy23’, ‘RokpoolFM’, ‘HDOLLAZ’, ‘_MrSpacely’, ‘Bestdocnyc’, ‘Rabidgun’, ‘flygatwick’, ‘live_china’, ‘friendlinks’, ‘retweetinator’, ‘iamamro’, ‘thayferreira’, ‘AldisDai39’, ‘AndersHana60’, ‘nonstopNEWS’, ‘VivaLaCash’, ‘TravelNewsFeeds’, ‘vuelosplus’, ‘threeporcupines’, ‘DemiAuzziefan’, ‘worldofprint’, ‘KevinEdwardsJr’, ‘REDDITSPAMMOR’, ‘NatValentine’, ‘ChanelLebrun’, ‘nowbot’, ‘hollyswansonUK’, ‘youngrhome’, ‘M_Abricot’, ‘thefakemandyv’, ‘scrapbookingpas’, ‘Naughtytimes’, ‘Opcode1300_bot’, ‘tellsecret’, ‘tboogie937’, ‘Climber_IT’, ‘comlist’, ‘with_a_smile’, ‘USN_retired’, ‘Climber_EngJobs’, ‘Climber_Finance’, ‘Climber_HRJobs’, ‘intanalwi’, ‘Climber_Sales’, ‘nadhiyamali’, ‘wonderfulquotes’, ‘MRAustria’, ‘O2Q’, ‘GL0’, ‘SookieBonTemps’, ‘MRSchweiz’, ‘latinasabor’, ‘nineleal’, ‘casservice’, ‘AltonGin54’, ‘KulerFeed’, ‘_cesaum’, ‘HFMONAIR’, ‘DeeOnDreeYah’, ‘rockstalgica’, ‘iamword’, ‘rpattzproject’, ‘madblackcatcom’, ‘ftfradio’, ‘marciomtc’, ‘SocialNetCircus’, ‘AnotherYearOver’, ‘ichig’, ‘tcikcik’, ‘HelenaMarie210’, ‘mrbax0’, ‘SWBot’, ‘DayTrends’, ‘_Embry_Call_’, ‘eProducts24’, ‘The_Sims_3’, ‘tom_ssa’, ‘woxy_vintage’, ‘urbanmusic2000’, ‘dopeguhxfresh’, ‘erections’, ‘DudeBroChill’, ‘lookingformoney’, ‘drnschneider’, ‘MosesMaimonides’, ’92Blues’, ‘elarmelar’, ‘rock937fm’, ‘sonicfm’, ‘erikadotnet’, ‘sky0311’, ‘weqx’, ‘brandamc’, ‘Hot106’, ‘woxy_live’, ‘ksopthecowboy’, ‘vixalius’, ‘cogourl’, ‘Cashintoday’, ‘Andrewdaflirt’, ‘oodle’, ‘mkephart25’, ‘doomed’, ‘spotifyuri’, ‘mangelat’, ‘Cody_K’, ‘swayswaystacey’, ‘KLLY953’, ‘onlaa’, ‘Ginger_Swan’, ‘Call_Embry’, ‘conservatweet’, ‘weerinlelystad’, ‘ruhanirabin’, ‘tmgadops’, ‘wakemeupinside1’, ‘horaoficial’, ‘xstex’, ‘franzidee’, ‘tommytrc’, ‘khopmusic’, ‘tez19’, ‘GaryGotnought’, ‘UnemployKiller’, ‘felloff’, ‘Kalediscope’, ‘TheRealSherina’, ‘jasonsfreestuff’, ‘johnkennick’, ‘sel_gomezx3’, ‘OE3’, ‘AddisonMontg’, ‘_rosieCAKES’, ‘neownblog’, ‘PrinceP23’, ‘ontd_fluffy’, ‘USofAl’, ‘Kacizzle88’, ‘somalush’, ‘FrankieNichelle’, ‘jiva_music’, ‘itz_cookie’, ‘soundOfTheTone’, ‘knowheremom’, ‘Jayme1988’, ‘TrafficPilot’, ‘tweetalot’, ‘TheStation1610’, ‘lasvegasdivorce’, ‘1000_LINKS_NOW2’, ‘KeepOnTweeting’, ‘uFreelance’, ‘ChocoKouture’, ‘Magic983’, ‘SnarkySharky’, ‘agthekid’, ‘cashinnow’, ‘jamokie’, ‘jessicastanely’, ‘Q103Albany’, ‘GPGTwit’, ‘xAmberNicholex’, ‘wjtlplaylist’, ‘sjAimee’, ‘chrisduhhh’, ‘failbus’, ‘1stwave’, ‘RichardBejah’, ‘nyanko_love’]

### Web Queries Overlap

How much overlap is there between tweets and trending web search queries?

I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:

0.0185654981775 or 2%

That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.

Notes

Can’t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via Twitter’s spritzer feed – that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).

# Build an Automatic Tagger in 200 lines with BOSS

My colleagues and I will be giving a talk on BOSS at Yahoo!’s Hack Day in NYC on October 9. To show developers the versatility of an open search API, I developed a simple toy example (see my past ones: TweetNews, Q&A) on the flight over that uses BOSS to generate data for training a machine learned text classifier. The resulting application basically takes two tags, some text, and tells you which tag best classifies that text. For example, you can ask the system if some piece of text is more liberal or conservative.

How does it work? BOSS offers delicious metadata for many search results that have been saved in delicious. This includes top tags, their frequencies, and the number of user saves. Additionally, BOSS makes available an option to retrieve extended search result abstracts. So, to generate a training set, I first build up a query list (100 delicious popular tags), search each query through BOSS (asking for 500 results per), and filter the results to just those that have delicious tags.

Basically, the collection logically looks like this:

[(result_1, delicious_tags), (result_2, delicious_tags) …]

Then, I invert the collection on the tags while retaining each result’s extended abstract and title fields (concatenated together)

This logically looks like this now:

[(tag_1, result_1.abstract + result_1.title), (tag_2, result_1.abstract + result_1.title), …, (tag_1, result_2.abstract + result_2.title), (tag_2, result_2.abstract + result_2.title) …]

To build a model comparing 2 tags, the system selects pairs from the above collection that have matching tags, converts the abstract + title text into features, and then passes the resulting pairs over to LibSVM to train a binary classification model.

Here’s how it works:

tagger viksi\$ python gen_training_test_set.py liberal conservative

tagger viksi\$ python autosvm.py training_data.txt test_data.txt

__Searching / Training Best Model

____Trained A Better Model: 60.5263

____Trained A Better Model: 68.4211

__Predicting Test Data

__Evaluation

____Right: 16

____Wrong: 4

____Total: 20

____Accuracy: 0.800000

get_training_test_set finds the pairs with matching tags and split those results into a training (80% of the pairs) and test set (20%), saving the data as training_data.txt and test_data.txt respectively. autosvm learns the best model (brute forcing the parameters for you – could be handy by itself as a general learning tool) and then applies it to the test set, reporting how well it did. In the above case, the system achieved 80% accuracy over 20 test instances.

Here’s another way to use it:

tagger viksi\$ python classify.py apple microsoft bill gates steve ballmer windows vista xp

microsoft

tagger viksi\$ python classify.py apple microsoft steve jobs ipod iphone macbook

apple

classify combines the above steps into an application that, given two tags and some text, will return which tag more likely describes the text. Or, in command line form, ‘python classify.py [tag1] [tag2] [some free text]’ => ‘tag1’ or ‘tag2’

My main goal here is not to build a perfect experiment or classifier (see caveats below), but to show a proof of concept of how BOSS or open search can be leveraged to build intelligent applications. BOSS isn’t just a search API, but really a general data API for powering any application that needs to party on a lot of the world’s knowledge.

I’ve open sourced the code here:

http://github.com/zooie/tagger

Caveats

Although the total lines of code is ~200 lines, the system is fairly state-of-the-art as it employs LibSVM for its learning model. However, this classifier setup has several caveats due to my time constraints and goals, as my main intention for this example was to show the awesomeness of the BOSS data. For example, training and testing on abstracts and titles means the top features will probably be inclusive of the query, so the test set may be fairly easy to score well on as well as not be representative of real input data. I did later add code to remove query related features from the test set and the accuracy seemed to dip just slightly. For classify.py, the ‘some free text’ input needs to be fairly large (about an extended abstract’s size) to be more accurate. Another caveat is what happens when both tags have been used to label a particular search result. The current system may only choose one tag, which may incur an error depending on what’s selected in the test set. Furthermore, the features I’m using are super simple and can be greatly improved with TFIDF scaling, normalization, feature selection (mutual information gain), etc. Also, more training / test instances (and check the distribution of the labels), baselines and evaluation measures should be tested.

I could have made this code a lot cleaner and shorter if I just used LibSVM’s python interface, but I for some reason forgot about that and wrote up scripts that parsed the stdout messages of the binaries to get something working fast (but dirty).