Category Archives: Trends

Some Stats about Twitter’s Content

Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [TechCrunch] [Mashable] [zooie], so I thought I’d focus more on the content here.

Duplication

By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.

Compression ratio

>>> 284023259 / 739273532 bytes

0.38419238171778614

Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.

We can dive even deeper into this area by analyzing the term overlap statistics to measure near duplication, or messages that aren’t necessarily identical but are close enough.

To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, after cleaning the text, the average number of tokens for a message is just 6.28, or 2.5x the size of a standard web search query.

Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N >= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).

You’ll notice N >=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.

N Term Samples Unique Keys Coverage
8 8548695 0.8356
6 8512672 0.8321
5 8476590 0.8286
4 8366391 0.8177
3 8098400 0.7916
2 5716566 0.5588
1 1013783 0.0991

 

 

 

 

 

 

 

URLs

URLs are present in ~18% of the tweets

Of those, ~65% of the URLs are unique

70K Unique Domains covering 2M URLS

Top Domains:

[‘bit.ly’, ‘tinyurl.com’, ‘twitpic.com’, ‘is.gd’, ‘myloc.me’, ‘ow.ly’, ‘ustre.am’, ‘cli.gs’, ‘tr.im’, ‘plurk.com’, ‘ff.im’, ‘tumblr.com’, ‘yfrog.com’, ‘140mafia.com’, ‘u.mavrev.com’, ‘twurl.nl’, ‘tweeterfollow.com’, ‘mypict.me’, ‘viagracan.com’, ‘vipfollowers.com’, ‘morefollowers.net’, ‘digg.com’, ‘tweeteradder.com’, ‘ping.fm’, ‘tiny.cc’, ‘followersnow.com’, ‘short.to’, ‘twit.ac’, ‘snipr.com’, ‘wefollow.com’, ‘tweet.sg’, ‘url4.eu’, ‘the-twitter-follow-train.info’, ‘fwix.com’, ‘budurl.com’, ‘su.pr’, ‘shar.es’, ‘tinychat.com’, ‘snipurl.com’, ‘loopt.us’, ‘migre.me’, ‘flic.kr’, ‘myspace.com’, ‘snurl.com’, ‘twitgoo.com’, ‘zshare.net’, ‘post.ly’, ‘bkite.com’, ‘yes.com’, ‘flickr.com’, ‘twitter.com’, ‘artistsforschapelle.com’, ‘140army.com’, ‘youtube.com’, ‘x.imeem.com’, ‘pic.gd’, ‘TwitterBackgrounds.com’, ‘raptr.com’, ‘twt.gs’, ‘twitthis.com’, ‘mobypicture.com’, ‘tobtr.com’, ‘ad.vu’, ‘sml.vg’, ‘rubyurl.com’, ‘tinylink.com’, ‘redirx.com’, ‘a2a.me’, ‘eCa.sh’, ‘vimeo.com’, ‘meadd.com’, ‘hotjobs.yahoo.com’, ‘doiop.com’, ‘myurl.in’, ‘urlpire.com’, ‘buzzup.com’, ‘freead.im’, ‘youradder.com’, ‘facebook.com’, ‘adf.ly’, ‘justin.tv’, ‘twitvid.com’, ‘adjix.com’, ‘twcauses.com’, ‘lkbk.nu’, ‘tlre.us’, ‘htxt.it’, ‘stickam.com’, ‘twubs.com’, ‘isy.gs’, ‘reverbnation.com’, ‘news.bbc.co.uk’, ‘sn.im’, ‘twibes.com’, ‘ustream.tv’, ‘trim.su’, ‘hashjobs.com’, ‘blogtv.com’, ‘jobs-cb.de’, ‘xsaimex.com’]

Retweets

~4% of messages are retweets

Replied @Users

~1M total replied-to users in this data set

37% of tweets contain ‘@x’ terms

Most Popular Replied-to Users (almost all celebrities):

[‘@mileycyrus’, ‘@jonasbrothers’, ‘@ddlovato’, ‘@mitchelmusso’, ‘@donniewahlberg’, ‘@souljaboytellem’, ‘@tommcfly’, ‘@addthis’, ‘@officialtila’, ‘@johncmayer’, ‘@shanedawson’, ‘@bowwow614’, ‘@jordanknight’, ‘@ryanseacrest’, ‘@perezhilton’, ‘@jonathanrknight’, ‘@petewentz’, ‘@tweetmeme’, ‘@adamlambert’, ‘@david_henrie’, ‘@dealsplus’, ‘@dwighthoward’, ‘@iamdiddy’, ‘@lancearmstrong’, ‘@songzyuuup’, ‘@imeem’, ‘@blakeshelton’, ‘@dannymcfly’, ‘@lilduval’, ‘@selenagomez’, ‘@markhoppus’, ‘@yelyahwilliams’, ‘@therealpickler’, ‘@stephenfry’, ‘@mrtweet.’, ‘@taylorswift13’, ‘@michaelsarver1’, ‘@davidarchie’, ‘@the_real_shaq’, ‘@tyrese4real’, ‘@britneyspears’, ‘@106andpark’, ‘@ashleytisdale’, ‘@mariahcarey’, ‘@kimkardashian’, ‘@wale’, ‘@mashable’, ‘@programapanico’, ‘@therealjordin’, ‘@listensto’, ‘@misskeribaby’, ‘@alyssa_milano’, ‘@alexalltimelow’, ‘@aplusk’, ‘@thisisdavina’, ‘@breakingnews:’, ‘@peterfacinelli’, ‘@truebloodhbo’, ‘@mgiraudofficial’, ‘@tonyspallelli’, ‘@mtv’, ‘@jackalltimelow’, ‘@dfizzy’, ‘@youngq’, ‘@tomfelton’, ‘@pooch_dog’, ‘@jonaskevin’, ‘@princesammie’, ‘@nkotb’, ‘@christianpior’, ‘@cthagod’, ‘@johnlloydtaylor’, ‘@neilhimself’, ‘@moontweet’, ‘@katyperry’, ‘@danilogentili’, ‘@mchammer’, ‘@rainnwilson’, ‘@joeymcintyre’, ‘@30secondstomars’, ‘@phillyd’, ‘@heidimontag’, ‘@mrpeterandre’, ‘@andyclemmensen’, ‘@crystalchappell’, ‘@kevindurant35’, ‘@huckluciano’, ‘@dannygokey’, ‘@jaketaustin’, ‘@revrunwisdom’, ‘@jamesmoran’, ‘@musewire’, ‘@dannywood’, ‘@nickiminaj’, ‘@akgovsarahpalin’, ‘@terrencej106’, ‘@mashable:’, ‘@drewryanscott’, ‘@mrtweet’, ‘@necolebitchie’, ‘@lilduval:’, ‘@willie_day26’, ‘@kirstiealley’, ‘@betthegame’, ‘@radiomsn’, ‘@alancarr’, ‘@rafinhabastos’, ‘@krisallen4real’, ‘@iamjericho’, ‘@breakingnews’, ‘@babygirlparis’, ‘@ladygaga’, ‘@chris_daughtry’, ‘@hypem’, ‘@danecook’, ‘@imcudi’, ‘@jeepersmedia’, ‘@buckhollywood’, ‘@kimmyt22’, ‘@giulianarancic’, ‘@chrisbrogan’, ‘@nasa’, ‘@addtoany’, ‘@nickcarter’, ‘@debbiefletcher’, ‘@marcoluque’, ‘@shaundiviney’, ‘@ogochocinco’, ‘@twitter’, ‘@eddieizzard’, ‘@youngbillymays’, ‘@real_ron_artest’, ‘@pink’, ‘@laurenconrad’, ‘@rubarrichello’, ‘@ianjamespoulter’, ‘@liltwist’, ‘@teyanataylor’, ‘@dougiemcfly’, ‘@theellenshow’, ‘@robkardashian’, ‘@sherrieshepherd’, ‘@justinbieber’, ‘@paulaabdul’, ‘@jason_manford’, ‘@jaredleto’, ‘@tracecyrus’, ‘@itsonalexa’, ‘@ddlovato:’, ‘@khloekardashian’, ‘@revrunwisdom:’, ‘@solangeknowles’, ‘@allison4realzzz’, ‘@nickjonas’, ‘@reply’, ‘@anarbor’, ‘@donlemoncnn’, ‘@gfalcone601’, ‘@moonfrye’, ‘@symphnysldr’, ‘@iamspectacular’, ‘@honorsociety’, ‘@questlove’, ‘@guykawasaki’, ‘@dawnrichard’, ‘@_maxwell_’, ‘@somaya_reece’, ‘@mandyyjirouxx’, ‘@teemwilliams’, ‘@greggarbo’, ‘@pennjillette’, ‘@mikeyway’, ‘@matthardybrand’, ‘@iamjonwalker’, ‘@andyroddick’, ‘@kohnt01’, ‘@chris_gorham’, ‘@seankingston’, ‘@joshgroban’, ‘@mousebudden’, ‘@misskatieprice’, ‘@spencerpratt’, ‘@wilw’, ‘@jgshock’, ‘@swear_bot’, ‘@joelmadden’, ‘@techcrunch’, ‘@americanwomannn’, ‘@kelly__rowland’, ‘@mionzera’, ‘@astro_127’, ‘@_@’, ‘@spam’, ‘@sookiebontemps’, ‘@drakkardnoir’, ‘@noh8campaign’, ‘@kayako’, ‘@trvsbrkr’, ‘@qbkilla’, ‘@mw55’, ‘@guykawasaki:’, ‘@donttrythis’, ‘@cv31’, ‘@liljjdagreat’, ‘@tiamowry’, ‘@nickensimontwit’, ‘@holdemtalkradio’, ‘@bradiewebbstack’, ‘@nytimes’, ‘@riskybizness23’, ‘@radityadika’, ‘@adrienne_bailon’, ‘@riccklopes’, ‘@jessicasimpson’, ‘@sportsnation’, ‘@jasonbradbury’, ‘@huffingtonpost’, ‘@oceanup’, ‘@gilbirmingham’, ‘@iconic88’, ‘@the’, ‘@thebrandicyrus’, ‘@gordela’, ‘@thedebbyryan’, ‘@jessemccartney’, ‘@?’, ‘@caiquenogueira’, ‘@celsoportiolli’, ‘@shontelle_layne’, ‘@calvinharris’, ‘@chattyman’, ‘@ali_sweeney’, ‘@anamariecox’, ‘@joshthomas87’, ‘@emilyosment’, ‘@nasa:’, ‘@sevinnyne6126’, ‘@thebiggerlights’, ‘@theboygeorge’, ‘@jbarsodmg’, ‘@goldenorckus’, ‘@warrenwhitlock’, ‘@bobbyedner’, ‘@myfabolouslife’, ‘@descargaoficial’, ‘@ochonflcinco85’, ‘@ninabrown’, ‘@billycurrington’, ‘@oprah’, ‘@junior_lima’, ‘@asherroth’, ‘@starbucks’, ‘@jason_pollock’, ‘@intanalwi’, ‘@harrislacewell’, ‘@serenajwilliams’, ‘@kevinruddpm’, ‘@bigbrotherhoh’, ‘@oliviamunn’, ‘@chamillionaire’, ‘@tamekaraymond’, ‘@teamwinnipeg’, ‘@littlefletcher’, ‘@piercethemind’, ‘@brookandthecity’, ‘@iranbaan:’, ‘@tonyrobbins’, ‘@maestro’, ‘@glennbeck’, ‘@1omarion’, ‘@nadhiyamali’, ‘@slimthugga’, ‘@jason_mraz’, ‘@profbrendi’, ‘@djaaries’, ‘@juanestwiter’, ‘@davegorman’, ‘@zackalltimelow’, ‘@mamajonas’, ‘@itschristablack’, ‘@skydiver’, ‘@gigva’, ‘@currensy_spitta’, ‘@paulwallbaby’, ‘@rpattzproject’, ‘@petewentz:’, ‘@rodrigovesgo’, ‘@drdrew’, ‘@sportsguy33’, ‘@cthagod:’, ‘@hollymadison123’, ‘@mjjnews’, ‘@itsbignicholas’, ‘@_supernatural_’, ‘@santoevandro’, ‘@demar_derozan’, ‘@marthastewart’, ‘@billganz62’, ‘@oodle’, ‘@davidleibrandt’]

Hashtags

~7% of messages contain hashtags

Total Unique Hashtags found: ~94k

Top Hashtags:

[‘#lies’, ‘#fb’, ‘#musicmonday’, ‘#truth’, ‘#iranelection’, ‘#moonfruit’, ‘#tendance’, ‘#jobs’, ‘#ihavetoadmit’, ‘#mariomarathon’, ‘#140mafia’, ‘#tcot’, ‘#zyngapirates’, ‘#followfriday’, ‘#spymaster’, ‘#ff’, ‘#1’, ‘#sotomayor’, ‘#turnon’, ‘#notagoodlook’, ‘#tweetmyjobs’, ‘#hiring:’, ‘#iran’, ‘#fun140’, ‘#jesus’, ‘#72b381.’, ‘#quote’, ‘#tinychat’, ‘#neda’, ‘#militarymon’, ‘#gr88’, ‘#trueblood’, ‘#fail’, ‘#news’, ‘#140army’, ‘#livestrong’, ‘#noh8’, ‘#wpc09’, ‘#music’, ‘#turnoff’, ‘#unacceptable’, ‘#twables’, ‘#masterchef’, ‘#noh84kradison’, ‘#writechat’, ‘#job’, ‘#squarespace’, ‘#michaeljackson’, ‘#2’, ‘#nothingpersonal’, ‘#iphone’, ‘#ala2009’, ‘#mj’, ‘#tdf’, ‘#blogtalkradio’, ‘#mlb’, ‘#1stdraftmovielines’, ‘#p2’, ‘#secretagent’, ‘#tlot’, ‘#72b381’, ‘#honduras’, ‘#twitter’, ‘#jtv’, ‘#tehran’, ‘#gorillapenis’, ‘#porn’, ‘#bb11’, ‘#sotoshow’, ‘#brazillovesatl’, ‘#google’, ‘#oneandother’, ‘#bb10’, ‘#chucknorris’, ‘#cmonbrazil’, ‘#agendasource’, ‘#travel’, ‘#ashes’, ‘#dumbledore’, ‘#freeschapelle’, ‘#tl’, ‘#dealsplus’, ‘#nsfw’, ‘#entourage’, ‘#tech’, ‘#hottest100’, ‘#3693dh…’, ‘#torchwood’, ‘#design’, ‘#teaparty’, ‘#love’, ‘#dontyouhate’, ‘#mileycyrus’, ‘#sgp’, ‘#harrypottersequels’, ‘#peteandinvisiblechildren’, ‘#stopretweets’, ‘#tscc’, ‘#wimbledon’, ‘#hive’, ‘#cubs’, ‘#3’, ‘#redsox’, ‘#photography’, ‘#voss’, ‘#snods’, ‘#lol’, ‘#socialmedia’, ‘#gop’, ‘#health’, ‘#esriuc’, ‘#green’, ‘#follow’, ‘#echo!’, ‘#obama’, ‘#digg’, ‘#shazam’, ‘#hhrs’, ‘#video’, ‘#moonfruit.’, ‘#swineflu’, ‘#politics’, ‘#ebuyer683’, ‘#umad’, ‘#quizdostandup’, ‘#thankyoumichael’, ‘#blogchat’, ‘#wordpress’, ‘#3693dh’, ‘#haiku’, ‘#ttparty’, ‘#lastfm:’, ‘#healthcare’, ‘#hcr’, ‘#ecgc’, ‘#seo’, ‘#apple’, ‘#chuck’, ‘#wine’, ‘#sammie’, ‘#h1n1’, ‘#marketing’, ‘#twitition’, ‘#happybirthdaymitchel18’, ‘#cnn’, ‘#lie’, ‘#rt:’, ‘#art’, ‘#nasa’, ‘#blog’, ‘#quotes’, ‘#bruno’, ‘#business’, ‘#palin’, ‘#mw2’, ‘#hcsm’, ‘#harrypotter’, ‘#4’, ‘#lastfm’, ‘#askclegg’, ‘#photo’, ‘#jobfeedr’, ‘#lgbt’, ‘#lies:’, ‘#ihavetoadmit.i’, ‘#jamlegend,’, ‘#truthbetold’, ‘#mcfly’, ‘#microsoft’, ‘#fashion’, ‘#tweetphoto’, ‘#ebuyer167201’, ‘#noh84adison’, ‘#5’, ‘#mets’, ‘#china’, ‘#bigprize’, ‘#whythehell’, ‘#money’, ‘#sophiasheart’, ‘#finance’, ‘#michael’, ‘#f1’, ‘#adamlambert100k’, ‘#web’, ‘#urwashed’, ‘#moonfruit!’, ‘#1:’, ‘#kayako’, ‘#lies.’, ‘#thankyouaaron’, ‘#food’, ‘#wow’, ‘#moonfruit,’, ‘#facebook’, ‘#ebuyer291’, ‘#ecomonday’, ‘#ihave’, ‘#happybdaydenise’, ‘#postcrossing’, ‘#ichc’, ‘#912’, ‘#demilovatolive’, ‘#gijoemoviefan’, ‘#funny’, ‘#media’, ‘#meowmonday’, ‘#israel’, ‘#blogger’, ‘#forasarney’, ‘#tv’, ‘#topgear’, ‘#chrisisadouche’, ‘#stlcards’, ‘#wec09’, ‘#forex’, ‘#aots1000’, ‘#celebrity’, ‘#dwarffilmtitles’, ‘#6’, ‘#yeg’, ‘#slaughterhouse’, ‘#nfl’, ‘#photog’, ‘#ny’, ‘#firstdraftmovies’, ‘#ufc’, ‘#reddit’, ‘#free’, ‘#iwish’, ‘#etsy’, ‘#rulez’, ‘#sports’, ‘#icmillion’, ‘#mmot’, ‘#webdesign’, ‘#deals’, ‘#moonfruit?’, ‘#pawpawty’, ‘#twitterfahndung’, ‘#billymaystribute’, ‘#sytycd’, ‘#runkeeper’, ‘#scotus’, ‘#yoconfieso’, ‘#mariomarathon,’, ‘#musicmondays’, ‘#lies,’, ‘#findbob’, ‘#realestate’, ‘#sohrab’, ‘#sales’, ‘#metal’, ‘#runescape’, ‘#hypem’, ‘#threadless’, ‘#gay’, ‘#isyouserious’, ‘#hollywood,’, ‘#2:’, ‘#ca,’, ‘#golf’, ‘#diadorock’, ‘#newyork,’, ‘#meteor’, ‘#dailyquestion’, ‘#photoshop’, ‘#saveiantojones’, ‘#musicmonday:’, ‘#rock’, ‘#sex’, ‘#mlbfutures’, ‘#ilove’, ‘#mikemozart’, ‘#nascar’, ‘#indico’, ‘#crossfitgames’, ‘#gratitude’, ‘#quote:’, ‘#creativetechs’, ‘#truth:’, ‘#sharepoint’, ‘#mkt’, ‘#why’, ‘#bigbrother’, ‘#tam7’, ‘#ihate’, ‘#futureruby’, ‘#slickrick’, ‘#105.3’, ‘#youareinatl’, ‘#vegan’, ‘#dontletmefindout’, ‘#imustadmit’, ‘#7’, ‘#twitterafterdark’, ‘#sunnyfacts’, ‘#gilad’, ‘#japan’, ‘#iremember’, ‘#97.3’, ‘#puffdaddy’, ‘#blogher’, ‘#ade2009’, ‘#aaliyah’, ‘#alfredosms’, ‘#95.1’, ‘#truth,’, ‘#twine’, ‘#hiring’]

Questions

Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:

5W’s, H, ? present ANYWHERE in tweet:

0.102789281948 or 10%

5W’s, H first token or ? last token:

0.0238229662219 or 2%

Just ? ANYWHERE in tweet:

0.0040984928533 or 0.4%

Users

Discovered ~2M unique users

Top Sending Users (many bots):

[‘followermonitor’, ‘Tweet_Words’, ‘currentcet’, ‘currentutc’, ‘whattimeisitnow’, ‘ItIsNow’, ‘ThinkingStiff’, ‘otvrecorder’, ‘delicious50’, ‘Porngus’, ‘craigslistjobs’, ‘GorPen’, ‘hashjobs’, ‘TransAlchemy2’, ‘bot_theta’, ‘CHRISVOSS’, ‘bot_iota’, ‘bot_kappa’, ‘TIPAS’, ‘VeolaJBanner’, ‘StacyDWatson’, ‘LMAObot’, ‘SarahJSlonecker’, ‘AllisonMRussell’, ‘bot_eta’, ‘SandraHOakley’, ‘bot_psi’, ‘bot_tau’, ‘LoreleiRMercer’, ‘bot_zeta’, ‘bot_gamma’, ‘bot_sigma’, ‘bot_lambda’, ‘bot_pi’, ‘bot_epsilon’, ‘bot_nu’, ‘bot_rho’, ‘bot_omicron’, ‘bot_khi’, ‘LindaTYoung’, ‘mensrightsindia’, ‘bot_omega’, ‘bot_ksi’, ‘bot_delta’, ‘bot_alpha’, ‘bot_phi’, ‘CindaDJenkins’, ‘bot_mu’, ‘ImogeneDPetit’, ‘bot_upsilon’, ‘OPENLIST_CA’, ‘openlist’, ‘isygs’, ‘dq_jumon’, ‘gamingscoop’, ‘MildredSLogan’, ‘ObiWanKenobi_’, ‘pulseSearch’, ‘MaryEVo’, ‘ImeldaGMcward’, ‘MaryJNewman’, ‘SharonTForde’, ‘LoriJCornelius’, ‘BrandyWPulliam’, ‘RhondaTLopez’, ‘AprilKOropeza’, ‘CarolETrotman’, ‘SusanATouvell’, ‘dinoperna’, ‘buzzurls’, ‘_Freelance_’, ‘DrSnooty’, ‘illstreet’, ‘bibliotaph_eyes’, ‘loc4lhost’, ‘bsiyo’, ‘BOTHOUSE’, ‘post_ads’, ‘qazkm’, ‘frugaldonkey’, ‘free_post’, ‘groovera’, ‘wonkawonkawonka’, ‘ForksGirlBella’, ‘casinopokera’, ‘dermdirectoryny’, ‘Yoowalk_chat’, ‘mstehr’, ‘hashgoogle’, ‘perry1949’, ‘ensiz_news’, ‘Bezplatno_net’, ‘timesmirror’, ‘work_freelance’, ‘cockbot’, ‘pdurham’, ‘bombtter_raw’, ‘ocha1’, ‘AlairAneko24’, ‘HaiIAmDelicious’, ‘Freshestjobs’, ‘fast_followers’, ‘LeadsForFree’, ‘RideOfYourLife’, ‘AlastairBotan30’, ‘helpmefast25’, ‘TheMLMWizard’, ‘uitrukken’, ‘adoptedALICE’, ‘TKATI’, ‘ezadsncash’, ‘tweetshelp’, ‘LAmetro_traffic’, ‘thinkpozzitive’, ‘StarrNeishaa’, ‘AldenCho36’, ‘JobHits’, ‘wootboot’, ‘smacula’, ‘faithclubdotnet’, ‘DmitriyVoronov’, ‘brownthumbgirl’, ‘NYCjobfeed’, ‘hfradiospacewx’, ‘FakeeKristenn’, ‘MLBDAILYTIMES’, ‘wildingp’, ‘JacksonsReview’, ‘EarthTimesPR’, ‘friedretweet’, ‘Wealthy23’, ‘RokpoolFM’, ‘HDOLLAZ’, ‘_MrSpacely’, ‘Bestdocnyc’, ‘Rabidgun’, ‘flygatwick’, ‘live_china’, ‘friendlinks’, ‘retweetinator’, ‘iamamro’, ‘thayferreira’, ‘AldisDai39’, ‘AndersHana60’, ‘nonstopNEWS’, ‘VivaLaCash’, ‘TravelNewsFeeds’, ‘vuelosplus’, ‘threeporcupines’, ‘DemiAuzziefan’, ‘worldofprint’, ‘KevinEdwardsJr’, ‘REDDITSPAMMOR’, ‘NatValentine’, ‘ChanelLebrun’, ‘nowbot’, ‘hollyswansonUK’, ‘youngrhome’, ‘M_Abricot’, ‘thefakemandyv’, ‘scrapbookingpas’, ‘Naughtytimes’, ‘Opcode1300_bot’, ‘tellsecret’, ‘tboogie937’, ‘Climber_IT’, ‘comlist’, ‘with_a_smile’, ‘USN_retired’, ‘Climber_EngJobs’, ‘Climber_Finance’, ‘Climber_HRJobs’, ‘intanalwi’, ‘Climber_Sales’, ‘nadhiyamali’, ‘wonderfulquotes’, ‘MRAustria’, ‘O2Q’, ‘GL0’, ‘SookieBonTemps’, ‘MRSchweiz’, ‘latinasabor’, ‘nineleal’, ‘casservice’, ‘AltonGin54’, ‘KulerFeed’, ‘_cesaum’, ‘HFMONAIR’, ‘DeeOnDreeYah’, ‘rockstalgica’, ‘iamword’, ‘rpattzproject’, ‘madblackcatcom’, ‘ftfradio’, ‘marciomtc’, ‘SocialNetCircus’, ‘AnotherYearOver’, ‘ichig’, ‘tcikcik’, ‘HelenaMarie210’, ‘mrbax0’, ‘SWBot’, ‘DayTrends’, ‘_Embry_Call_’, ‘eProducts24’, ‘The_Sims_3’, ‘tom_ssa’, ‘woxy_vintage’, ‘urbanmusic2000’, ‘dopeguhxfresh’, ‘erections’, ‘DudeBroChill’, ‘lookingformoney’, ‘drnschneider’, ‘MosesMaimonides’, ’92Blues’, ‘elarmelar’, ‘rock937fm’, ‘sonicfm’, ‘erikadotnet’, ‘sky0311’, ‘weqx’, ‘brandamc’, ‘Hot106’, ‘woxy_live’, ‘ksopthecowboy’, ‘vixalius’, ‘cogourl’, ‘Cashintoday’, ‘Andrewdaflirt’, ‘oodle’, ‘mkephart25’, ‘doomed’, ‘spotifyuri’, ‘mangelat’, ‘Cody_K’, ‘swayswaystacey’, ‘KLLY953’, ‘onlaa’, ‘Ginger_Swan’, ‘Call_Embry’, ‘conservatweet’, ‘weerinlelystad’, ‘ruhanirabin’, ‘tmgadops’, ‘wakemeupinside1’, ‘horaoficial’, ‘xstex’, ‘franzidee’, ‘tommytrc’, ‘khopmusic’, ‘tez19’, ‘GaryGotnought’, ‘UnemployKiller’, ‘felloff’, ‘Kalediscope’, ‘TheRealSherina’, ‘jasonsfreestuff’, ‘johnkennick’, ‘sel_gomezx3’, ‘OE3’, ‘AddisonMontg’, ‘_rosieCAKES’, ‘neownblog’, ‘PrinceP23’, ‘ontd_fluffy’, ‘USofAl’, ‘Kacizzle88’, ‘somalush’, ‘FrankieNichelle’, ‘jiva_music’, ‘itz_cookie’, ‘soundOfTheTone’, ‘knowheremom’, ‘Jayme1988’, ‘TrafficPilot’, ‘tweetalot’, ‘TheStation1610’, ‘lasvegasdivorce’, ‘1000_LINKS_NOW2’, ‘KeepOnTweeting’, ‘uFreelance’, ‘ChocoKouture’, ‘Magic983’, ‘SnarkySharky’, ‘agthekid’, ‘cashinnow’, ‘jamokie’, ‘jessicastanely’, ‘Q103Albany’, ‘GPGTwit’, ‘xAmberNicholex’, ‘wjtlplaylist’, ‘sjAimee’, ‘chrisduhhh’, ‘failbus’, ‘1stwave’, ‘RichardBejah’, ‘nyanko_love’]

Web Queries Overlap

How much overlap is there between tweets and trending web search queries?

I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:

0.0185654981775 or 2%

That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.

Notes

Can’t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via Twitter’s spritzer feed – that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).

Advertisements

10 Comments

Filed under Data Mining, Research, Search, Social, Statistics, Trends, Twitter

Is the Facebook Application Platform Fair?

Take a look at this stats deck from O’Reilly’s Graphing Social Patterns conference:

http://en.oreilly.com/gspeast2008/public/asset/attachment/2950

Fairly in-depth and recent [6/01/2008] analysis of the application usage in Facebook and MySpace.

As expected, lots of power law behavior.

I found the slides describing churn to be pretty interesting. Since October 2007, nine of the top fifteen most popular applications are new. However, only three of those new applications debuted after March 2008. I expect the amount of churn in the top spots to continue to drop based on the recent declining active usage trends and Facebook’s efforts to curb application spam (new UI that puts applications in a separate profile tab, app module minimizing, viral friend messaging limits, security compliance, etc).

What I would find even more interesting is a study of the number of applications users install, and how those moving averages have changed over time. Like say the number of applications a typical user installs is 4. Once the user reaches that threshold, what’s the churn like then? Specifically, what are the chances that a user will add a new app? Or maybe an even better metric: how long does it take, and how does this length of time compare to when the user had 1 app and increased to 2, or 2 apps and increased to 3, etc.? Basically, what’s the adoption rate/times based on current application counts?

I believe it becomes harder to influence a user to add or replace for a new app if the number of current apps the user has is high. I think most users, without even knowing it, have a threshold of how many total apps they are willing display on their profile – and that this threshold is based on an ongoing evaluation of the utility and efficiency of the page. Each app takes up real estate on the profile page, and a “rational” user will only show so many until page load times degrade and/or core modules (wall, general information, albums, networks) get drowned in clutter and thus become difficult for users to locate. Of course, social networks like MySpace which have very minimal profile page design constraints prove that most users are irrational 😉 – but it’s this design control that greatly helped Facebook dominate the market IMO.

If this is true, then it means that first movers really, really win in the Facebook apps world. Companies like Slide and Rockyou manage many of the top applications, and given the power law market share phenomena, they control a majority stake of application usage and installs. Many of these companies had the early bird advantage, and once winners, always winners – acquisitions of emerging applications, leveraging branding and existing audiences (a.k.a monopoly) to cross promote potentially copy-cat applications faster and wider than the competition, etc. Monopolies inside Facebook have unsettling ramifications, as they block newcomers from capturing profile space. If they fail to innovate (as most monopolies) then next-gen application development may never get through.

Now, if users do have an application count threshold, and it becomes successively more difficult to replace/add a new app as this count increases, then any apps developed now have a substantially rarer chance of gaining market share. If winner’s win, first movers reap, and churn becomes improbable over time, then the early top apps have already most likely filled up users’ allocated app slots.

I find thinking of the profile page as a resource allocation problem rather fascinating. Essentially, there are finite resources on a page and we expect rational users to perform some optimization to allocate resources to maximize utility for themselves and for others (potential game theory link). Once users fill up these resources, human laziness kicks in. Another warrant for improbable churn is that users who want to add new applications after filling up their resource limit will need to remove an existing app to make space. The standards for change are higher now, as the user must compare the new app to an existing preferred app (which probably is a popular early-bird app that friends use), and so the decision will incur a trade-off.

One could also argue that with more apps available now (second slide shows that despite sluggish usage the # of app’s being developed is still growing insanely) users are burdened with more choices. Or, one could argue because most users have reached their app limit, and thus, churn has become improbable, the discoverability of new apps among friends (a critical channel for adoption) also becomes improbable.

Under this theory, especially in context of Facebook’s current efforts and app stats, the growth of new app adoption in social networks will continue to slow down.

So what can be done here?

The platform needs to encourage more churn by building a fairer market that matches users to high quality apps that satisfy their expressed intents. At the end of the day, these applications are really just web pages, but unlike the web, they do not leverage important primitives like linking and meta tags. Search engines like Google and Yahoo use these features extensively to calculate authority and relevance. In the long run, as the number of sources increases, advanced ranking algorithms and marketplaces are necessary to scale and ensure fairness to worthy tail publishers. Maybe social networks should inherit these system properties to bolster their tail applications.

Also, Facebook needs to encourage users to variate or add more applications to their profile page. Facebook’s move to put applications in its own profile tab may very well achieve this goal, but at a consequence of lowering their visibility.

Anyways, just some random thoughts about the current state of Facebook apps. It’ll be very interesting to see how their platform progresses and how it will be perceived by end users and developers in the future.

3 Comments

Filed under Economics, Facebook, Non-Technical-Read, Social, Statistics, Trends

Techmeme Leaderboard 2007 – More!

I’m an avid reader of Techmeme. Love the idea, UI, freshness, coverage, and most of all the quality of the articles.

When the Techmeme Leaderboard debuted earlier this month, lots of buzz circulated the blogosphere. Me, being a huge fan of partying on data, loved the concept, and wanted to take the analysis even further (Yuvi style, but with a search twist).

So yesterday I wrote up some code to crawl and analyze Techmeme articles over the whole year (Leaderboard shows the Top 50 sources for this month). I took a snapshot of Techmeme at 1:00PM every day between beginning January – end of September of 2007.

I computed basic statistics, like number of stories by author and source, as well as more involved measurements like the top word mentions of the year – in total and by category (used simple NLP to clean up the text and remove stopwords).

So, without further ado, here are the results:

Number of Stories by Author in 2007, Ranked
Number of Stories by Source in 2007, Ranked
Most Mentioned Words in 2007, Ranked
* words are stemmed
Most Mentioned Words, by Category, Trends in 2007, Ranked

Hope you guys find these results super interesting and useful.

1 Comment

Filed under Blog Stuff, Data Mining, Information Retrieval, NLP, Statistics, Techmeme, Trends