Tag Archives: Twitter

An Evaluation of Google’s Realtime Search

How timely are the results returned from Google’s Realtime (RT) Search Engine? How often do Twitter results appear in these results? Over the weekend I developed a few basic experiments to find out and published the results below.

Key Findings

  • For location-based queries, there’s nearly a flip of a coin chance (43%) that a Twitter result will be the #1 ranked result.
  • For general knowledge queries, there’s a 23% chance that a Twitter result will be #1.
  • The newest Twitter results are usually 4 seconds old. The newest Web results are 10x older (41 seconds).
  • A top ranking Twitter result for a location-based query is usually 2 minutes old (compared with Web which is 22 minutes old – again nearly 10x older).
  • When Twitter results appear at least one of them is in the top ranked position
Experiment #1 – General Knowledge

I crawled 1,370 article titles from Wikipedia and ran each title as a query into Google RT search.

Market Shares

81% of all queries returned search results that included web page results
23% of all queries returned search results that included Twitter results
7% of all queries returned 0 search results

70% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 23% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 6736 seconds (or 1 hr and 52 minutes) old.
When a Tweet was the #1 ranked result, that result on average was 261 seconds (or 4 minutes and 21 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 2 seconds

Tail

Query length was between 1 – 12 words (where 1-2 word long queries are most popular)
Worth noting that no Twitter results appear for queries with greater than 5 words

Experiment #2 – Location

I crawled 265 major populated U.S. cities from the U.S. Census Bureau and ran each city name as a query into Google RT search.

Market Shares

73% of all queries returned search results that included web page results
43% of all queries returned search results that included Twitter results
5% of all queries returned 0 search results

52% of all queries had a web page result in the #1 ranked position
When Twitter results appeared there was always at least one result in the #1 ranked position (so 43% of queries)

Time Lag

When a web page was the #1 ranked result, that result on average was 1341 seconds (or 22 minutes and 21 seconds) old.
When a Tweet was the #1 ranked result, that result on average was 138 seconds (or 2 minutes and 18 seconds) old.

The average age of the top 10% newest web page results (across all queries) is 41 seconds
The average age of the top 10% newest Twitter results (across all queries) is 4 seconds

Tail

Query length was between 1 – 3 words
Worth noting that no Twitter results appear for 3 word long queries

Implementation Details

  • Generated Wiki queries by running “site:en.wikipedia.org” searches on Google and Blekko, and extracting the titles (en.wikipedia.org/{title_is_here}) from the result links. Side point: I tried Bing but the result links had mostly one word long titles (Bing seems to really bias query length in their ranking) and I wanted more diversity to test out tail queries.
  • Crawled cities (for the location-based queries) from http://www.census.gov/popest/cities/tables/SUB-EST2009-01.csv

Caveats

  • I ran these experiments at 2:45a PST on Monday. The location-based queries all relate to U.S., so probably not many people up at that time generating up-to-date information. The time lag stats could vary depending on when these experiments are ran. I did however re-run the experiments in the late morning and didn’t see much difference in the timings.
  • I ran all queries through Google’s normal web search engine with ‘Latest’ on (in the left bar under Search Tools). These results are not exactly the same as those generated from the standalone Google Realtime Search portal, which seems to bias Tweets more while the ‘Latest’ results seems to find middle ground between real-time Twitter results and web page results. I used ‘Latest’ because it seems like it would be the most popular gateway to Google’s Realtime search results.
Advertisements

5 Comments

Filed under Blog Stuff, Computer Science, Data Mining, Google, Information Retrieval, Research, Search, Social, Statistics, Twitter, Wikipedia

Some Stats about Twitter’s Content

Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [TechCrunch] [Mashable] [zooie], so I thought I’d focus more on the content here.

Duplication

By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.

Compression ratio

>>> 284023259 / 739273532 bytes

0.38419238171778614

Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.

We can dive even deeper into this area by analyzing the term overlap statistics to measure near duplication, or messages that aren’t necessarily identical but are close enough.

To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, after cleaning the text, the average number of tokens for a message is just 6.28, or 2.5x the size of a standard web search query.

Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N >= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).

You’ll notice N >=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.

N Term Samples Unique Keys Coverage
8 8548695 0.8356
6 8512672 0.8321
5 8476590 0.8286
4 8366391 0.8177
3 8098400 0.7916
2 5716566 0.5588
1 1013783 0.0991

 

 

 

 

 

 

 

URLs

URLs are present in ~18% of the tweets

Of those, ~65% of the URLs are unique

70K Unique Domains covering 2M URLS

Top Domains:

[‘bit.ly’, ‘tinyurl.com’, ‘twitpic.com’, ‘is.gd’, ‘myloc.me’, ‘ow.ly’, ‘ustre.am’, ‘cli.gs’, ‘tr.im’, ‘plurk.com’, ‘ff.im’, ‘tumblr.com’, ‘yfrog.com’, ‘140mafia.com’, ‘u.mavrev.com’, ‘twurl.nl’, ‘tweeterfollow.com’, ‘mypict.me’, ‘viagracan.com’, ‘vipfollowers.com’, ‘morefollowers.net’, ‘digg.com’, ‘tweeteradder.com’, ‘ping.fm’, ‘tiny.cc’, ‘followersnow.com’, ‘short.to’, ‘twit.ac’, ‘snipr.com’, ‘wefollow.com’, ‘tweet.sg’, ‘url4.eu’, ‘the-twitter-follow-train.info’, ‘fwix.com’, ‘budurl.com’, ‘su.pr’, ‘shar.es’, ‘tinychat.com’, ‘snipurl.com’, ‘loopt.us’, ‘migre.me’, ‘flic.kr’, ‘myspace.com’, ‘snurl.com’, ‘twitgoo.com’, ‘zshare.net’, ‘post.ly’, ‘bkite.com’, ‘yes.com’, ‘flickr.com’, ‘twitter.com’, ‘artistsforschapelle.com’, ‘140army.com’, ‘youtube.com’, ‘x.imeem.com’, ‘pic.gd’, ‘TwitterBackgrounds.com’, ‘raptr.com’, ‘twt.gs’, ‘twitthis.com’, ‘mobypicture.com’, ‘tobtr.com’, ‘ad.vu’, ‘sml.vg’, ‘rubyurl.com’, ‘tinylink.com’, ‘redirx.com’, ‘a2a.me’, ‘eCa.sh’, ‘vimeo.com’, ‘meadd.com’, ‘hotjobs.yahoo.com’, ‘doiop.com’, ‘myurl.in’, ‘urlpire.com’, ‘buzzup.com’, ‘freead.im’, ‘youradder.com’, ‘facebook.com’, ‘adf.ly’, ‘justin.tv’, ‘twitvid.com’, ‘adjix.com’, ‘twcauses.com’, ‘lkbk.nu’, ‘tlre.us’, ‘htxt.it’, ‘stickam.com’, ‘twubs.com’, ‘isy.gs’, ‘reverbnation.com’, ‘news.bbc.co.uk’, ‘sn.im’, ‘twibes.com’, ‘ustream.tv’, ‘trim.su’, ‘hashjobs.com’, ‘blogtv.com’, ‘jobs-cb.de’, ‘xsaimex.com’]

Retweets

~4% of messages are retweets

Replied @Users

~1M total replied-to users in this data set

37% of tweets contain ‘@x’ terms

Most Popular Replied-to Users (almost all celebrities):

[‘@mileycyrus’, ‘@jonasbrothers’, ‘@ddlovato’, ‘@mitchelmusso’, ‘@donniewahlberg’, ‘@souljaboytellem’, ‘@tommcfly’, ‘@addthis’, ‘@officialtila’, ‘@johncmayer’, ‘@shanedawson’, ‘@bowwow614’, ‘@jordanknight’, ‘@ryanseacrest’, ‘@perezhilton’, ‘@jonathanrknight’, ‘@petewentz’, ‘@tweetmeme’, ‘@adamlambert’, ‘@david_henrie’, ‘@dealsplus’, ‘@dwighthoward’, ‘@iamdiddy’, ‘@lancearmstrong’, ‘@songzyuuup’, ‘@imeem’, ‘@blakeshelton’, ‘@dannymcfly’, ‘@lilduval’, ‘@selenagomez’, ‘@markhoppus’, ‘@yelyahwilliams’, ‘@therealpickler’, ‘@stephenfry’, ‘@mrtweet.’, ‘@taylorswift13’, ‘@michaelsarver1’, ‘@davidarchie’, ‘@the_real_shaq’, ‘@tyrese4real’, ‘@britneyspears’, ‘@106andpark’, ‘@ashleytisdale’, ‘@mariahcarey’, ‘@kimkardashian’, ‘@wale’, ‘@mashable’, ‘@programapanico’, ‘@therealjordin’, ‘@listensto’, ‘@misskeribaby’, ‘@alyssa_milano’, ‘@alexalltimelow’, ‘@aplusk’, ‘@thisisdavina’, ‘@breakingnews:’, ‘@peterfacinelli’, ‘@truebloodhbo’, ‘@mgiraudofficial’, ‘@tonyspallelli’, ‘@mtv’, ‘@jackalltimelow’, ‘@dfizzy’, ‘@youngq’, ‘@tomfelton’, ‘@pooch_dog’, ‘@jonaskevin’, ‘@princesammie’, ‘@nkotb’, ‘@christianpior’, ‘@cthagod’, ‘@johnlloydtaylor’, ‘@neilhimself’, ‘@moontweet’, ‘@katyperry’, ‘@danilogentili’, ‘@mchammer’, ‘@rainnwilson’, ‘@joeymcintyre’, ‘@30secondstomars’, ‘@phillyd’, ‘@heidimontag’, ‘@mrpeterandre’, ‘@andyclemmensen’, ‘@crystalchappell’, ‘@kevindurant35’, ‘@huckluciano’, ‘@dannygokey’, ‘@jaketaustin’, ‘@revrunwisdom’, ‘@jamesmoran’, ‘@musewire’, ‘@dannywood’, ‘@nickiminaj’, ‘@akgovsarahpalin’, ‘@terrencej106’, ‘@mashable:’, ‘@drewryanscott’, ‘@mrtweet’, ‘@necolebitchie’, ‘@lilduval:’, ‘@willie_day26’, ‘@kirstiealley’, ‘@betthegame’, ‘@radiomsn’, ‘@alancarr’, ‘@rafinhabastos’, ‘@krisallen4real’, ‘@iamjericho’, ‘@breakingnews’, ‘@babygirlparis’, ‘@ladygaga’, ‘@chris_daughtry’, ‘@hypem’, ‘@danecook’, ‘@imcudi’, ‘@jeepersmedia’, ‘@buckhollywood’, ‘@kimmyt22’, ‘@giulianarancic’, ‘@chrisbrogan’, ‘@nasa’, ‘@addtoany’, ‘@nickcarter’, ‘@debbiefletcher’, ‘@marcoluque’, ‘@shaundiviney’, ‘@ogochocinco’, ‘@twitter’, ‘@eddieizzard’, ‘@youngbillymays’, ‘@real_ron_artest’, ‘@pink’, ‘@laurenconrad’, ‘@rubarrichello’, ‘@ianjamespoulter’, ‘@liltwist’, ‘@teyanataylor’, ‘@dougiemcfly’, ‘@theellenshow’, ‘@robkardashian’, ‘@sherrieshepherd’, ‘@justinbieber’, ‘@paulaabdul’, ‘@jason_manford’, ‘@jaredleto’, ‘@tracecyrus’, ‘@itsonalexa’, ‘@ddlovato:’, ‘@khloekardashian’, ‘@revrunwisdom:’, ‘@solangeknowles’, ‘@allison4realzzz’, ‘@nickjonas’, ‘@reply’, ‘@anarbor’, ‘@donlemoncnn’, ‘@gfalcone601’, ‘@moonfrye’, ‘@symphnysldr’, ‘@iamspectacular’, ‘@honorsociety’, ‘@questlove’, ‘@guykawasaki’, ‘@dawnrichard’, ‘@_maxwell_’, ‘@somaya_reece’, ‘@mandyyjirouxx’, ‘@teemwilliams’, ‘@greggarbo’, ‘@pennjillette’, ‘@mikeyway’, ‘@matthardybrand’, ‘@iamjonwalker’, ‘@andyroddick’, ‘@kohnt01’, ‘@chris_gorham’, ‘@seankingston’, ‘@joshgroban’, ‘@mousebudden’, ‘@misskatieprice’, ‘@spencerpratt’, ‘@wilw’, ‘@jgshock’, ‘@swear_bot’, ‘@joelmadden’, ‘@techcrunch’, ‘@americanwomannn’, ‘@kelly__rowland’, ‘@mionzera’, ‘@astro_127’, ‘@_@’, ‘@spam’, ‘@sookiebontemps’, ‘@drakkardnoir’, ‘@noh8campaign’, ‘@kayako’, ‘@trvsbrkr’, ‘@qbkilla’, ‘@mw55’, ‘@guykawasaki:’, ‘@donttrythis’, ‘@cv31’, ‘@liljjdagreat’, ‘@tiamowry’, ‘@nickensimontwit’, ‘@holdemtalkradio’, ‘@bradiewebbstack’, ‘@nytimes’, ‘@riskybizness23’, ‘@radityadika’, ‘@adrienne_bailon’, ‘@riccklopes’, ‘@jessicasimpson’, ‘@sportsnation’, ‘@jasonbradbury’, ‘@huffingtonpost’, ‘@oceanup’, ‘@gilbirmingham’, ‘@iconic88’, ‘@the’, ‘@thebrandicyrus’, ‘@gordela’, ‘@thedebbyryan’, ‘@jessemccartney’, ‘@?’, ‘@caiquenogueira’, ‘@celsoportiolli’, ‘@shontelle_layne’, ‘@calvinharris’, ‘@chattyman’, ‘@ali_sweeney’, ‘@anamariecox’, ‘@joshthomas87’, ‘@emilyosment’, ‘@nasa:’, ‘@sevinnyne6126’, ‘@thebiggerlights’, ‘@theboygeorge’, ‘@jbarsodmg’, ‘@goldenorckus’, ‘@warrenwhitlock’, ‘@bobbyedner’, ‘@myfabolouslife’, ‘@descargaoficial’, ‘@ochonflcinco85’, ‘@ninabrown’, ‘@billycurrington’, ‘@oprah’, ‘@junior_lima’, ‘@asherroth’, ‘@starbucks’, ‘@jason_pollock’, ‘@intanalwi’, ‘@harrislacewell’, ‘@serenajwilliams’, ‘@kevinruddpm’, ‘@bigbrotherhoh’, ‘@oliviamunn’, ‘@chamillionaire’, ‘@tamekaraymond’, ‘@teamwinnipeg’, ‘@littlefletcher’, ‘@piercethemind’, ‘@brookandthecity’, ‘@iranbaan:’, ‘@tonyrobbins’, ‘@maestro’, ‘@glennbeck’, ‘@1omarion’, ‘@nadhiyamali’, ‘@slimthugga’, ‘@jason_mraz’, ‘@profbrendi’, ‘@djaaries’, ‘@juanestwiter’, ‘@davegorman’, ‘@zackalltimelow’, ‘@mamajonas’, ‘@itschristablack’, ‘@skydiver’, ‘@gigva’, ‘@currensy_spitta’, ‘@paulwallbaby’, ‘@rpattzproject’, ‘@petewentz:’, ‘@rodrigovesgo’, ‘@drdrew’, ‘@sportsguy33’, ‘@cthagod:’, ‘@hollymadison123’, ‘@mjjnews’, ‘@itsbignicholas’, ‘@_supernatural_’, ‘@santoevandro’, ‘@demar_derozan’, ‘@marthastewart’, ‘@billganz62’, ‘@oodle’, ‘@davidleibrandt’]

Hashtags

~7% of messages contain hashtags

Total Unique Hashtags found: ~94k

Top Hashtags:

[‘#lies’, ‘#fb’, ‘#musicmonday’, ‘#truth’, ‘#iranelection’, ‘#moonfruit’, ‘#tendance’, ‘#jobs’, ‘#ihavetoadmit’, ‘#mariomarathon’, ‘#140mafia’, ‘#tcot’, ‘#zyngapirates’, ‘#followfriday’, ‘#spymaster’, ‘#ff’, ‘#1’, ‘#sotomayor’, ‘#turnon’, ‘#notagoodlook’, ‘#tweetmyjobs’, ‘#hiring:’, ‘#iran’, ‘#fun140’, ‘#jesus’, ‘#72b381.’, ‘#quote’, ‘#tinychat’, ‘#neda’, ‘#militarymon’, ‘#gr88’, ‘#trueblood’, ‘#fail’, ‘#news’, ‘#140army’, ‘#livestrong’, ‘#noh8’, ‘#wpc09’, ‘#music’, ‘#turnoff’, ‘#unacceptable’, ‘#twables’, ‘#masterchef’, ‘#noh84kradison’, ‘#writechat’, ‘#job’, ‘#squarespace’, ‘#michaeljackson’, ‘#2’, ‘#nothingpersonal’, ‘#iphone’, ‘#ala2009’, ‘#mj’, ‘#tdf’, ‘#blogtalkradio’, ‘#mlb’, ‘#1stdraftmovielines’, ‘#p2’, ‘#secretagent’, ‘#tlot’, ‘#72b381’, ‘#honduras’, ‘#twitter’, ‘#jtv’, ‘#tehran’, ‘#gorillapenis’, ‘#porn’, ‘#bb11’, ‘#sotoshow’, ‘#brazillovesatl’, ‘#google’, ‘#oneandother’, ‘#bb10’, ‘#chucknorris’, ‘#cmonbrazil’, ‘#agendasource’, ‘#travel’, ‘#ashes’, ‘#dumbledore’, ‘#freeschapelle’, ‘#tl’, ‘#dealsplus’, ‘#nsfw’, ‘#entourage’, ‘#tech’, ‘#hottest100’, ‘#3693dh…’, ‘#torchwood’, ‘#design’, ‘#teaparty’, ‘#love’, ‘#dontyouhate’, ‘#mileycyrus’, ‘#sgp’, ‘#harrypottersequels’, ‘#peteandinvisiblechildren’, ‘#stopretweets’, ‘#tscc’, ‘#wimbledon’, ‘#hive’, ‘#cubs’, ‘#3’, ‘#redsox’, ‘#photography’, ‘#voss’, ‘#snods’, ‘#lol’, ‘#socialmedia’, ‘#gop’, ‘#health’, ‘#esriuc’, ‘#green’, ‘#follow’, ‘#echo!’, ‘#obama’, ‘#digg’, ‘#shazam’, ‘#hhrs’, ‘#video’, ‘#moonfruit.’, ‘#swineflu’, ‘#politics’, ‘#ebuyer683’, ‘#umad’, ‘#quizdostandup’, ‘#thankyoumichael’, ‘#blogchat’, ‘#wordpress’, ‘#3693dh’, ‘#haiku’, ‘#ttparty’, ‘#lastfm:’, ‘#healthcare’, ‘#hcr’, ‘#ecgc’, ‘#seo’, ‘#apple’, ‘#chuck’, ‘#wine’, ‘#sammie’, ‘#h1n1’, ‘#marketing’, ‘#twitition’, ‘#happybirthdaymitchel18’, ‘#cnn’, ‘#lie’, ‘#rt:’, ‘#art’, ‘#nasa’, ‘#blog’, ‘#quotes’, ‘#bruno’, ‘#business’, ‘#palin’, ‘#mw2’, ‘#hcsm’, ‘#harrypotter’, ‘#4’, ‘#lastfm’, ‘#askclegg’, ‘#photo’, ‘#jobfeedr’, ‘#lgbt’, ‘#lies:’, ‘#ihavetoadmit.i’, ‘#jamlegend,’, ‘#truthbetold’, ‘#mcfly’, ‘#microsoft’, ‘#fashion’, ‘#tweetphoto’, ‘#ebuyer167201’, ‘#noh84adison’, ‘#5’, ‘#mets’, ‘#china’, ‘#bigprize’, ‘#whythehell’, ‘#money’, ‘#sophiasheart’, ‘#finance’, ‘#michael’, ‘#f1’, ‘#adamlambert100k’, ‘#web’, ‘#urwashed’, ‘#moonfruit!’, ‘#1:’, ‘#kayako’, ‘#lies.’, ‘#thankyouaaron’, ‘#food’, ‘#wow’, ‘#moonfruit,’, ‘#facebook’, ‘#ebuyer291’, ‘#ecomonday’, ‘#ihave’, ‘#happybdaydenise’, ‘#postcrossing’, ‘#ichc’, ‘#912’, ‘#demilovatolive’, ‘#gijoemoviefan’, ‘#funny’, ‘#media’, ‘#meowmonday’, ‘#israel’, ‘#blogger’, ‘#forasarney’, ‘#tv’, ‘#topgear’, ‘#chrisisadouche’, ‘#stlcards’, ‘#wec09’, ‘#forex’, ‘#aots1000’, ‘#celebrity’, ‘#dwarffilmtitles’, ‘#6’, ‘#yeg’, ‘#slaughterhouse’, ‘#nfl’, ‘#photog’, ‘#ny’, ‘#firstdraftmovies’, ‘#ufc’, ‘#reddit’, ‘#free’, ‘#iwish’, ‘#etsy’, ‘#rulez’, ‘#sports’, ‘#icmillion’, ‘#mmot’, ‘#webdesign’, ‘#deals’, ‘#moonfruit?’, ‘#pawpawty’, ‘#twitterfahndung’, ‘#billymaystribute’, ‘#sytycd’, ‘#runkeeper’, ‘#scotus’, ‘#yoconfieso’, ‘#mariomarathon,’, ‘#musicmondays’, ‘#lies,’, ‘#findbob’, ‘#realestate’, ‘#sohrab’, ‘#sales’, ‘#metal’, ‘#runescape’, ‘#hypem’, ‘#threadless’, ‘#gay’, ‘#isyouserious’, ‘#hollywood,’, ‘#2:’, ‘#ca,’, ‘#golf’, ‘#diadorock’, ‘#newyork,’, ‘#meteor’, ‘#dailyquestion’, ‘#photoshop’, ‘#saveiantojones’, ‘#musicmonday:’, ‘#rock’, ‘#sex’, ‘#mlbfutures’, ‘#ilove’, ‘#mikemozart’, ‘#nascar’, ‘#indico’, ‘#crossfitgames’, ‘#gratitude’, ‘#quote:’, ‘#creativetechs’, ‘#truth:’, ‘#sharepoint’, ‘#mkt’, ‘#why’, ‘#bigbrother’, ‘#tam7’, ‘#ihate’, ‘#futureruby’, ‘#slickrick’, ‘#105.3’, ‘#youareinatl’, ‘#vegan’, ‘#dontletmefindout’, ‘#imustadmit’, ‘#7’, ‘#twitterafterdark’, ‘#sunnyfacts’, ‘#gilad’, ‘#japan’, ‘#iremember’, ‘#97.3’, ‘#puffdaddy’, ‘#blogher’, ‘#ade2009’, ‘#aaliyah’, ‘#alfredosms’, ‘#95.1’, ‘#truth,’, ‘#twine’, ‘#hiring’]

Questions

Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:

5W’s, H, ? present ANYWHERE in tweet:

0.102789281948 or 10%

5W’s, H first token or ? last token:

0.0238229662219 or 2%

Just ? ANYWHERE in tweet:

0.0040984928533 or 0.4%

Users

Discovered ~2M unique users

Top Sending Users (many bots):

[‘followermonitor’, ‘Tweet_Words’, ‘currentcet’, ‘currentutc’, ‘whattimeisitnow’, ‘ItIsNow’, ‘ThinkingStiff’, ‘otvrecorder’, ‘delicious50’, ‘Porngus’, ‘craigslistjobs’, ‘GorPen’, ‘hashjobs’, ‘TransAlchemy2’, ‘bot_theta’, ‘CHRISVOSS’, ‘bot_iota’, ‘bot_kappa’, ‘TIPAS’, ‘VeolaJBanner’, ‘StacyDWatson’, ‘LMAObot’, ‘SarahJSlonecker’, ‘AllisonMRussell’, ‘bot_eta’, ‘SandraHOakley’, ‘bot_psi’, ‘bot_tau’, ‘LoreleiRMercer’, ‘bot_zeta’, ‘bot_gamma’, ‘bot_sigma’, ‘bot_lambda’, ‘bot_pi’, ‘bot_epsilon’, ‘bot_nu’, ‘bot_rho’, ‘bot_omicron’, ‘bot_khi’, ‘LindaTYoung’, ‘mensrightsindia’, ‘bot_omega’, ‘bot_ksi’, ‘bot_delta’, ‘bot_alpha’, ‘bot_phi’, ‘CindaDJenkins’, ‘bot_mu’, ‘ImogeneDPetit’, ‘bot_upsilon’, ‘OPENLIST_CA’, ‘openlist’, ‘isygs’, ‘dq_jumon’, ‘gamingscoop’, ‘MildredSLogan’, ‘ObiWanKenobi_’, ‘pulseSearch’, ‘MaryEVo’, ‘ImeldaGMcward’, ‘MaryJNewman’, ‘SharonTForde’, ‘LoriJCornelius’, ‘BrandyWPulliam’, ‘RhondaTLopez’, ‘AprilKOropeza’, ‘CarolETrotman’, ‘SusanATouvell’, ‘dinoperna’, ‘buzzurls’, ‘_Freelance_’, ‘DrSnooty’, ‘illstreet’, ‘bibliotaph_eyes’, ‘loc4lhost’, ‘bsiyo’, ‘BOTHOUSE’, ‘post_ads’, ‘qazkm’, ‘frugaldonkey’, ‘free_post’, ‘groovera’, ‘wonkawonkawonka’, ‘ForksGirlBella’, ‘casinopokera’, ‘dermdirectoryny’, ‘Yoowalk_chat’, ‘mstehr’, ‘hashgoogle’, ‘perry1949’, ‘ensiz_news’, ‘Bezplatno_net’, ‘timesmirror’, ‘work_freelance’, ‘cockbot’, ‘pdurham’, ‘bombtter_raw’, ‘ocha1’, ‘AlairAneko24’, ‘HaiIAmDelicious’, ‘Freshestjobs’, ‘fast_followers’, ‘LeadsForFree’, ‘RideOfYourLife’, ‘AlastairBotan30’, ‘helpmefast25’, ‘TheMLMWizard’, ‘uitrukken’, ‘adoptedALICE’, ‘TKATI’, ‘ezadsncash’, ‘tweetshelp’, ‘LAmetro_traffic’, ‘thinkpozzitive’, ‘StarrNeishaa’, ‘AldenCho36’, ‘JobHits’, ‘wootboot’, ‘smacula’, ‘faithclubdotnet’, ‘DmitriyVoronov’, ‘brownthumbgirl’, ‘NYCjobfeed’, ‘hfradiospacewx’, ‘FakeeKristenn’, ‘MLBDAILYTIMES’, ‘wildingp’, ‘JacksonsReview’, ‘EarthTimesPR’, ‘friedretweet’, ‘Wealthy23’, ‘RokpoolFM’, ‘HDOLLAZ’, ‘_MrSpacely’, ‘Bestdocnyc’, ‘Rabidgun’, ‘flygatwick’, ‘live_china’, ‘friendlinks’, ‘retweetinator’, ‘iamamro’, ‘thayferreira’, ‘AldisDai39’, ‘AndersHana60’, ‘nonstopNEWS’, ‘VivaLaCash’, ‘TravelNewsFeeds’, ‘vuelosplus’, ‘threeporcupines’, ‘DemiAuzziefan’, ‘worldofprint’, ‘KevinEdwardsJr’, ‘REDDITSPAMMOR’, ‘NatValentine’, ‘ChanelLebrun’, ‘nowbot’, ‘hollyswansonUK’, ‘youngrhome’, ‘M_Abricot’, ‘thefakemandyv’, ‘scrapbookingpas’, ‘Naughtytimes’, ‘Opcode1300_bot’, ‘tellsecret’, ‘tboogie937’, ‘Climber_IT’, ‘comlist’, ‘with_a_smile’, ‘USN_retired’, ‘Climber_EngJobs’, ‘Climber_Finance’, ‘Climber_HRJobs’, ‘intanalwi’, ‘Climber_Sales’, ‘nadhiyamali’, ‘wonderfulquotes’, ‘MRAustria’, ‘O2Q’, ‘GL0’, ‘SookieBonTemps’, ‘MRSchweiz’, ‘latinasabor’, ‘nineleal’, ‘casservice’, ‘AltonGin54’, ‘KulerFeed’, ‘_cesaum’, ‘HFMONAIR’, ‘DeeOnDreeYah’, ‘rockstalgica’, ‘iamword’, ‘rpattzproject’, ‘madblackcatcom’, ‘ftfradio’, ‘marciomtc’, ‘SocialNetCircus’, ‘AnotherYearOver’, ‘ichig’, ‘tcikcik’, ‘HelenaMarie210’, ‘mrbax0’, ‘SWBot’, ‘DayTrends’, ‘_Embry_Call_’, ‘eProducts24’, ‘The_Sims_3’, ‘tom_ssa’, ‘woxy_vintage’, ‘urbanmusic2000’, ‘dopeguhxfresh’, ‘erections’, ‘DudeBroChill’, ‘lookingformoney’, ‘drnschneider’, ‘MosesMaimonides’, ’92Blues’, ‘elarmelar’, ‘rock937fm’, ‘sonicfm’, ‘erikadotnet’, ‘sky0311’, ‘weqx’, ‘brandamc’, ‘Hot106’, ‘woxy_live’, ‘ksopthecowboy’, ‘vixalius’, ‘cogourl’, ‘Cashintoday’, ‘Andrewdaflirt’, ‘oodle’, ‘mkephart25’, ‘doomed’, ‘spotifyuri’, ‘mangelat’, ‘Cody_K’, ‘swayswaystacey’, ‘KLLY953’, ‘onlaa’, ‘Ginger_Swan’, ‘Call_Embry’, ‘conservatweet’, ‘weerinlelystad’, ‘ruhanirabin’, ‘tmgadops’, ‘wakemeupinside1’, ‘horaoficial’, ‘xstex’, ‘franzidee’, ‘tommytrc’, ‘khopmusic’, ‘tez19’, ‘GaryGotnought’, ‘UnemployKiller’, ‘felloff’, ‘Kalediscope’, ‘TheRealSherina’, ‘jasonsfreestuff’, ‘johnkennick’, ‘sel_gomezx3’, ‘OE3’, ‘AddisonMontg’, ‘_rosieCAKES’, ‘neownblog’, ‘PrinceP23’, ‘ontd_fluffy’, ‘USofAl’, ‘Kacizzle88’, ‘somalush’, ‘FrankieNichelle’, ‘jiva_music’, ‘itz_cookie’, ‘soundOfTheTone’, ‘knowheremom’, ‘Jayme1988’, ‘TrafficPilot’, ‘tweetalot’, ‘TheStation1610’, ‘lasvegasdivorce’, ‘1000_LINKS_NOW2’, ‘KeepOnTweeting’, ‘uFreelance’, ‘ChocoKouture’, ‘Magic983’, ‘SnarkySharky’, ‘agthekid’, ‘cashinnow’, ‘jamokie’, ‘jessicastanely’, ‘Q103Albany’, ‘GPGTwit’, ‘xAmberNicholex’, ‘wjtlplaylist’, ‘sjAimee’, ‘chrisduhhh’, ‘failbus’, ‘1stwave’, ‘RichardBejah’, ‘nyanko_love’]

Web Queries Overlap

How much overlap is there between tweets and trending web search queries?

I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:

0.0185654981775 or 2%

That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.

Notes

Can’t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via Twitter’s spritzer feed – that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).

10 Comments

Filed under Data Mining, Research, Search, Social, Statistics, Trends, Twitter

Delicious.com Gets Fresh

Today we have officially released an experimental Fresh tab on the delicious.com page. Learn more about it here on the delicious blog.

I won’t rehash too much of the delicious blog post as that describes the motivation and idea in detail, but the basic idea was to advance and apply the TweetNews model to the latest stream of delicious bookmarks. The result is what we feel to be a pretty relevant and fresh (updates every minute or so) homepage. Please check it out and bookmark it (no pun intended). Just a simple start to hopefully better surfacing of content on delicious – expect more updates soon.

delicious also greatly advanced its search experience and sharing options in this release. You can learn more about it from the release posts here and soon here.

Leave a comment

Filed under Boss, delicious, Non-Technical-Read, Open, Research, Social, Twitter, Uncategorized, Yahoo