Near the end of July, I crawled a sample of ~10M tweets. On my way over from Open Hack Day NYC yesterday I finally got some time to do some preliminary analysis of this data. Several posts have analyzed Twitter’s traffic stats [TechCrunch] [Mashable] [zooie], so I thought I’d focus more on the content here.

### Duplication

By compressing the data and comparing the before and after sizes, one can get a pretty decent understanding of the duplication factor. To do this, I extracted just the raw text messages, sorted them, and then ran gzip over the sorted set.

Compression ratio

>>> 284023259 / 739273532 bytes

0.38419238171778614

Typically, for text compression, gzip-like programs can achieve around 50% without the sort (and sorting typically helps), and here we get 38%. A standard text corpus consists of much larger document sizes, so it’s interesting to see a similar or larger duplication factor for tweets.

We can dive even deeper into this area by analyzing the term overlap statistics to measure near duplication, or messages that aren’t necessarily identical but are close enough.

To do this, I first cleaned the text (removed stopwords, stemmed terms, normalized case). Interesting, after cleaning the text, the average number of tokens for a message is just 6.28, or 2.5x the size of a standard web search query.

Then, I employed consistent term sampling to select N representatives for each cleaned message and coalesced the representatives together as a single key. By comparing the total number of unique keys to messages, one can infer the near duplication factor. Also, the higher the N, the higher the threshold is to match (so N >= 6, 6 being the average number of tokens per message, probably means that two messages that generate the same key are exact duplicates).

You’ll notice N >=6 converges around 84%, implying that after cleaning the text, 16% of the messages exactly match some other message. Additionally, when N = 2 (or requiring 2 / 6 tokens or 33% of the text on average) to match, 45% of the messages collide with other messages in the corpus. At N = 2, matching often means the messages discuss the same general topic, but aren’t close near duplicates.

 N Term Samples Unique Keys Coverage 8 8548695 0.8356 6 8512672 0.8321 5 8476590 0.8286 4 8366391 0.8177 3 8098400 0.7916 2 5716566 0.5588 1 1013783 0.0991

### URLs

URLs are present in ~18% of the tweets

Of those, ~65% of the URLs are unique

70K Unique Domains covering 2M URLS

Top Domains:

### Retweets

~4% of messages are retweets

### Replied @Users

~1M total replied-to users in this data set

37% of tweets contain ‘@x’ terms

Most Popular Replied-to Users (almost all celebrities):

### Hashtags

~7% of messages contain hashtags

Total Unique Hashtags found: ~94k

Top Hashtags:

### Questions

Hard to infer exactly whether a message is a question or not, so I ran a couple of different filters:

5W’s, H, ? present ANYWHERE in tweet:

0.102789281948 or 10%

5W’s, H first token or ? last token:

0.0238229662219 or 2%

Just ? ANYWHERE in tweet:

0.0040984928533 or 0.4%

### Users

Discovered ~2M unique users

Top Sending Users (many bots):

### Web Queries Overlap

How much overlap is there between tweets and trending web search queries?

I took the top trending queries during the days of my twitter crawl from Google Trends, then query expanded each trending query until the length was 6 tokens so as to equalize the average lengths. Then, I simply counted how many tweets match at least 2 (cleaned) tokens of any of these query-expanded trends:

0.0185654981775 or 2%

That’s it for now. I have some more stats but need a bit more time to clean those up before publishing here.

Notes

Can’t distribute my data set unfortunately, but it shouldn’t take too long to assemble a comparable set via Twitter’s spritzer feed – that’ll probably be more useful as it’ll be more update-to-date than the one I analyzed here. Feel free to pull my stats off if you find them useful (top hashtags and users are in JSON format).

Filed under Data Mining, Research, Search, Social, Statistics, Trends, Twitter

1. Interesting stats, especially the compression rate analysis. Would have guessed that Twitter being “quite” international, the corpus would have been much larger and thus compression not as high as the results you’ve come up with.
I hope that one day, Twitter will provide us with some real and exclusive stats about their data. That’d be interesting.

2. Hi Vik,

Great post and very insightful even though I admit I didn’t quite follow all the math.

It’s the type of stuff I’d expect Twitter to be publishing themselves more often.

Cheers,
Mike
http://Semantinet.com/publishers

3. Do you have any script for scraping(indexing) Tweets? I am in need of one to use AI in it..

4. Babak

Hi Vik,

Very interesting stats, I would like to know how you generate a single key for each message by coalescing its representatives. How can I use keys to find exact duplicate or almost duplicates? Also do you know a source that can provide stemmed terms?

Example:

Wild cat
Cat behaves wildly
The cat is wild
Is she like a wild cat?
wildcats.com
Wild animals like to eat cat
She looks wild like her cat
John saved the cat from the wild animal

According to your algorithm all above should produce a unique key, right? and how your data analysis work in this example?