Calculate the Sales Performance of Any Company in Two Minutes Flat

Originally published on LATKA’s SaaS.

tl;dr: Run a few searches (patterns provided) on LinkedIn and Google to determine the current reps and churned reps per year for any given company. Divide churned reps per year by the current number of reps to compute the percentage of the current sales team that will churn in one year’s time – producing the Sales Team Churn (STC) metric. Can compute STC for any company (including privately held ones), and compare to other companies’ STCs (see benchmarks table covering several key SaaS companies below).

To compute this sales performance metric, you need to run three simple searches.

First, search for the company on LinkedIn – “in People” search:

assess_li_init_search

Then click “All Filters” near the top right of the search results page, check the company under “Current companies”, and in the “Title” field (near the bottom of the form), copy and paste the following string:

“sales development” OR “SDR” OR “account executive” OR “sales executive” OR “account manager”

Then click “Apply”.

assess_li_current_search_filter

Near the top left of the search results page is the number of results (highlighted in blue in the image below). Save this figure, which represents the current number of sales related heads at the company.

assess_li_current_count

Then click “All Filters” again. Uncheck the company under “Current companies”, and check the company under “Past companies”. In the “Company” field box, enter “-” then the company name (in quotes if company name is more than one token).

assess_li_past_search_filter

Click “Apply”, and save the number of results, which represents the number of sales related heads that no longer work at the company.

assess_li_past_count

Then search the company name padded with “founded” on Google, and find the date the company was started.

assess_google_founded

These three searches give us current sales heads, past sales heads, and the founding date of the company.

With these three values, we can calculate the following:

Churned Sales Heads Per Year = Past Sales Heads / Number of Years Since Founded

Sales Team Churn (STC) = Churned Sales Heads Per Year / Current Sales Heads

This metric represents how much of the current sales team will churn in one year (so lower the number the better).

Let’s plug-in the example illustrated in the screenshots above:

Zendesk, founded in 2007, has currently 476 sales related heads, and has lost 146 heads.

Churn per year = 146 / (2019 – 2007) = ~12

STC = 12 / 476 = ~2.5%

So, what’s a good STC score? Here are the metric values for various enterprise companies (full table here):

assess_enterprise_benchmarks

The key columns are the last three. Lower STC the better. Been capturing monthly snapshots of STC values for these basket of companies – sparkline and percentage changed provided in the last two columns. Ideally, STC is below 5%, and the STC is steady or going down over time (in green if going down – otherwise red). See full table here.

The average of this basket of companies is 5.9% – so would recommend targeting under 5% in general. This sounds low, but that’s because churn per year is calculated over all time – since the incorporation date. This is because I was unable to filter LinkedIn search results to a select period of time (like over the past year). In the early years, sales and corresponding sales churn will be low or nonexistent, which heavily down weights this value. As companies age, the number of years increases, and the number of current sales reps increase, which can both lower this score. This is how even a mature company intentionally churning out 10-20% of their sales team annually (to remove bottom performers and raise the bar for all reps) can still score a STC below 5%. The key is to compare STC with other companies at similar stages versus directly comparing with internal annual attrition numbers given the differences in calculation methods and assumptions.

There are many interesting insights to glean from this table. For example, there are several companies with < 1% STC, including LaunchDarkly, Zoom (although growing), Twilio and GitLab – all of which have self-service trial flows.

There are also companies with > 10% STC percentages (to reiterate – this means more than 10% of their sales team will attrit in a year), including Domo, Dropbox, Gainsight and Zuora.

There are interesting competitive benchmarks as well – for ex. Intercom is 5.3% vs Drift’s 10.27% – nearly half of Drift’s sales churn.

I’ve been snapshotting this metric for several months, as it’s also important to look at how STC is changing over time (see last column above – is the sales health getting better or worse?). There are companies on this list who have consistently increased (Gainsight, Zuora) or decreased (Anaplan, Domo) their STC.

So, why the focus on sales team size for assessing the sales performance of a company?

When I was running Infer, I recall my syncs with Aaron Levie (CEO / Co-Founder of Box, and one of our angel investors at the time), and the first question he would always ask was:

“How many salespeople do you have now?”

This is a really great question for assessing a business quickly. The more reps you have, the more deals you can close. The more reps you have, the more market demand you have. The more reps you have, the more you’re spending to drive growth. The more reps you have, the better your hiring process, sales leadership and culture are for attracting talent.

The twist with STC, is that we’re not just looking at the number of present sales people, but also factoring in the attrition rate. So, the less reps churned, the more reps that are hitting their quotas.

C-level executives focus on key business metrics such as gross or net revenue retention. High churn means there’s a leaky bucket, which can sink even a high growth new ARR business.

This applies not just to customers, but to sales reps too – often overlooked compared to customer churn. If a company is losing a good chunk of their sales team each year, then the company is losing the ability to generate and close revenue making opportunities, and has to spend to hire new reps – and burn valuable time ramping them.

In general, sales reps face higher risks of attrition than those in other functions. Their goals are measurable, and if they miss, they’re fired or leave – and the best performing reps can receive promotions and make more money elsewhere. But even if higher attrition is expected, what is the healthy, right amount of attrition for any given company to experience?

Losing reps is very much a leaky bucket just like customer churn, and deserves metrics and magic numbers to abide to – hence STC and the 5% target.

The STC metric has several nice properties:

It’s accessible, and can be computed for any company (unlike revenue which is hard to reliably discern for private companies). Anyone can quickly and easily derive this metric with a free LinkedIn Account. It does not require internal financials – so it’s fully transparent internally and externally. The metric is normalized, so can compare companies’ STCs for benchmarking purposes. This metric updates often (as sales people tend to update their profiles quickly). It’s also a more forward looking indicator than revenue (need sales people first before closing more deals).

This metric can also be adapted to different roles outside of sales by simply changing the title query (could tailor to executive management roles with search tokens like “Chief”, “VP”, etc.).

VPs of Sales, CEOs, VCs, VPs of FP&A, job candidates, hedge fund quants, etc. should be leveraging STC-like metrics for planning models, researching competitive landscapes and evaluating investments or job opportunities.

Special thanks to the following for reviewing drafts of this piece:

Ajay Agarwal, George Bischof, Matt Cooley, David Gilmour, Amar Goel, Naren Gupta, Nathan Latka, Nick Mehta, David Kellogg, Vish Makhijani, Tomasz Tunguz and Jeff Weiner.

***

Do note, some technical caveats regarding this approach for computing STC:

Can be difficult to be precise with LinkedIn’s search. Two examples: (1) company name may overlap with other companies with similar names (2) can’t search past titles (at least via the free account) (3) have to negate ‘-’ company name in order to find people that worked in the past at a company and are no longer at that company

If a company is fresh and hasn’t had any churn (or really low churn), then this is meaningless (it has the best STC score). Just means they haven’t had enough sales team churn yet. Still useful to look at comparable companies who’ve been in-market longer and use their STC scores for headcount planning / forecasting.

Different companies use different sales titles. May need to adjust the title query on a per company basis.

The sales titles query does not exclusively correspond to quota carrying reps.

Not everyone is on LinkedIn. This is usually not a problem for sales reps as they typically want to advertise themselves in order to be able to connect with potential customers – esp. at tech companies. Even when not all sales reps are accounted for (when a rep is not on LinkedIn or the title query doesn’t capture that person), I find that this metric is still directionally useful especially on a relative basis with other companies (it’s consistently sampling each company in the same manner).

What I learned working under Turing Award winner Jim Gray – 10 years since his disappearance

A few days ago, I sent out the following email remembering Jim to close friends and colleagues. I did not intend to share this broadly, but I received many positive replies encouraging me to post this publicly … so, here it is:

10 years ago this month, my mentor and idol Jim Gray disappeared at sea. I had the greatest fortune to work under him. We had published a paper together in the weeks leading up to his final sail.

I learned so much from Jim, and I think about him a lot. We even incorporated our company name after his saying “party on the data” (Party On Data, Inc.). To this day, I continue to unpack and internalize the lessons that I absorbed while working with him more than a decade ago.

I learned it’s important to make time for the unexpected. It felt like nearly everyone I knew in my circle had talked to Jim at some point – even people from very different fields of study. I am not sure how he was able to be so generous with his time given his position and stature, but if someone reached out to him with an interesting hook and was passionate about it, he made time. And, it wasn’t just a meet and greet – he truly listened. He would be engrossed in the conversation, and listen intently like you were the professor and he was the student. He made you feel special; that you had some unique insight about a very important problem area.

Jim’s projects were proof that making time and having an open mind for the unexpected – to converse and collaborate with people beyond your direct connections – can lead to breakthroughs in other disciplines. He made significant contributions to the SkyServer project, which helped astronomers federate multiple terabytes of images to serve as a virtual observatory (a world-wide telescope). He applied a similar approach to mapping data with the Virtual Earth project (the precursor to Google Maps – minus the AJAX).

In today’s world, with so many distractions and communication channels (many of which are being inundated with spam), it has become commonplace to ignore cold inbound requests. However, I learned from Jim that it’s crucial to make time for surprises, and to give back. No other Turing award winner responded to my emails and calls – only Jim did – and by doing so, he completely changed my life for the better. Jim instilled confidence in me that I mattered in this world, if someone important like him was willing to invest his precious time with me.

I learned from Jim that it’s important to tackle very good and crisp problems – and to work diligently on them (and to write everything down). Jim had a knack for identifying great problems. Comb through his website – it’s hard to find a dud in his resume or project list. I remember we were in talks with a major hospital about an ambitious project to improve the detection of diseases. The hospital group was willing to support this high profile project in any way we needed (thanks to Jim being a rock star), but Jim immediately knew we wouldn’t be able to develop crisp, tangible results within a year. He wanted more control, and craved a project with more short-term wins.

When Jim did identify a crisp problem to work on, he went all-in. His work ethic was second to none. We were once at a baseball game together, and I could tell from his demeanor that he was itching to get back to the office to continue our work. If I was working late in the office, he would work late too. He remained technical (writing code right next to me) and deep in the weeds despite his senior management role. He was responsive. Late night emails to emails at 5 AM (he liked waking up with the birds). He pushed me to work harder – not by asking for it, but by leading by example.

With any project, but especially database projects, there are so many low-level, unsexy problems (like data cleaning) that have to be addressed before you can “party on the data.” “99% perspiration and 1% inspiration,” he would always say, like it was a constant, inevitable force of nature that we have to equip ourselves for. He prepared me for that, which taught me how to stay focused and work harder.

I learned that it’s important to learn about key inflection points from previous products and projects – to know your history in order to make better decisions. Jim was a master story teller – constantly reciting history. I still remember his story about how Sybase was outgunned in the database market, but their innovation with stored procedures gave them the differentiation they needed to fight the fight with DB2 and Oracle. And, by the way, he was very laudatory of key features coming from competitors. He would never dismiss them – he loved the innovation, no matter where it came from. He wanted the truth for how to best solve a particular problem.

He loved to teach his lessons too. I recall one time I asked him a technical question, and an important call came through to his desk phone. He immediately hung up the call and took me to the whiteboard to teach me what he knew about the topic in question. Who does that? You’d be lucky to meet with your thesis advisor or manager once a week for 30 minutes, but Jim was present for me like this almost every day.

Jim set the highest management bar imaginable for me. He showed me why I should optimize 100% for mentorship throughout my career – not company brand – and to do this every time.

I sometimes wish he could see me now, as I feel like I wasn’t able to show him everything that I could do then, as I was still in the infancy of my career. I know better now where I excel (and where I don’t). At the time, I wanted to learn and do it all, like there was no tomorrow. He encouraged me to follow my passions – even if they were outside his comfort zone. Jim had no ego – he would loop in another mentor who knew more about a particular subject area. He gave me rope to learn, fail and rebuild. I tried to savor every minute I had with Jim, and am thankful that I did.

Despite his amazing technical accomplishments, I honestly do not remember many of the technical concepts that he had taught me. What I remember is how he made me feel. That’s what lives on and matters most. He gave me confidence, by just responding to me, and of course, working side by side with me. He rewarded my proactive outreach (which certainly encouraged me to send many more cold emails thereafter), and most importantly, taught me how to approach and solve big problems.

Jim truly inspires me, and I am forever grateful for what he did for me and my career. I sincerely hope that one day, I too, can have such a profound positive influence on so many people’s lives.

To being tenacious like Jim.

Ranking Companies on Sales Culture & Retention

A company’s sales retention rate is a very important indicator of business health. If you have a good gauge on this, you could better answer questions such as: should I join that company’s sales department, will I be able to progress up the ladder, are reps hitting their numbers, are they providing effective training, should I invest money in this business, etc. But how does one measure this rate especially from an outside vantage point? This is where LinkedIn comes to the rescue. I essentially cross applied the approach I took to measuring engineering retention to sales.

This chart reveals several key technology companies ranked in reverse order of sales churn – so higher on the chart (or longer the bar) the higher the churn (so from worst at the top to best at the bottom).

So how are we defining sales churn here? I calculated the measurement as follows: I took the number of people who have ever churned in a sales role from the company and divide that by the number of days since incorporation for that respective company (call this Churn Per Day), and then I compute the ratio of how many sales people will churn in one year (the run rate i.e. Churn Per Day * 365) over the number of current sales people employed.

For ex. if you look at the top row, which is Zenefits, the value is 0.40 – which means that 40% of the current sales team size will churn in a one year period. In order to maintain that sales team size and corresponding revenue, the company will need to hire 40% of their team – and sooner than in a year as that churn likely spreads throughout the year as well as given new sales hire ramping periods (if you’re churning a ramped rep and say it takes one quarter to ramp a new sales rep, then you need to hire a new head at least one quarter beforehand to avoid a revenue dip).

A few more notes:

The color saturation indicates Churn Per Day – the darker the color, the higher the Churn Per Day.

Caveats listed in the previous post on engineering retention apply to this analysis too.

Top Tech Companies Ranked By Engineering Retention

(TL;DR) Here’s the ranking going from top to bottom (so higher / longer the better):

How did you measure this?

By running advanced Linkedin searches and counting up the hits. Specifically, for each company, at their headquarters location only, I searched for profiles that were or are software engineers, and had at least 1+ years of experience. Then I filtered these results in two ways:

1) Counting how many of those profiles used to work at the company in question (and not currently). Call this result Past Not Current Count.

2) Separately (not applying the above filter), filtering to those who are currently working at the company for at least 1+ years. Call this Current Count.

I also computed the number of days since incorporation for each respective company to be able to compute Churn Per Day – which is simply dividing Past Not Current Count by the number of days since incorporation.

Then I took this rate, and computed how long in years it would take for each company to churn through all of their Current Count or current heads who were or are software engineers and who’ve been with the company for at least 1 year (those who possess the most tribal wisdom and arguably deserve more retention benefits). Call this the Wipeout Period (in years) figure. This is what’s plotted in the chart above and is represented by the size of the bars – so longer the better for a company.

What does the color hue indicate?

The Churn Per Day (described in the previous answer). The darker the color the higher the churn rate.

Who’s safe and who’s at risk?

I would think under a 10 year wipeout period (esp. if you’re a larger and mature company) would be very scary.

In general (disclaimer – subjective – would like to run this over more comps) greater than 20 years feels safe, but if you’re dark green (and hence experience more churn per day) then in order to keep your wipeout period long you need to be hiring many new engineering heads constantly (but you may not always be hot in tech to be able to maintain such a hiring pace!).

What are the caveats with this analysis?

There are several, but to mention a few:

Past Not Current Count biases against older companies – for ex. Microsoft has had more churn than # of present heads because they’ve been in business for a long time.

I needed more precise filtering options than what was available from Linkedin to be able to properly remove software internships (although could argue that’s still valid churn – means that the company wasn’t able to pipeline them into another internship or full-time position) as well as ensure that the Past Not Current Count factored only software engineers at the time that they were working at that company. So, given the lack of these filters, a better description for the above chart would be Ranking Retention of Folks with Software Experience.

Also, this analysis assumes the Churn Per Day figure is the same for all folks currently 1+ years at their respective company, even though it’s likely that the churn rate is different depending the # of years you’re at the company (I’m essentially assuming it’s a wash – that the distributions of the historical Past Not Current vs Current are similar).

Ranking High Schools Based On Outcomes

High school is arguably the most important phase of your education. Some families will move just to be in the district of the best ranked high school in the area. However, the factors that these rankings are based on, such as test scores, tuition amount, average class size, teacher to student ratio, location, etc. do not measure key outcomes such as what colleges or jobs the students get into.

Unfortunately, measuring outcomes is tough – there’s no data source that I know of that describes how all past high school students ended up. However, I thought it would be a fun experiment to approximate using LinkedIn data. I took eight top high schools in the Bay Area (see the table below) and ran a whole bunch of advanced LinkedIn search queries to find graduates from these high schools while also counting up their key outcomes like what colleges they graduated from, what companies they went on to work for, what industries are they in, what job titles have they earned, etc.

The results are quite interesting. Here are a few statistics:

College Statistics

The top 5 high schools that have the largest share of users going to top private schools (Ivy League’s + Stanford + Caltech + MIT) are (1) Harker (2) Gunn (3) Saratoga (4) Lynbrook (5) Bellarmine.
The top 5 high schools that have the largest share of users going to the top 3 UC’s (Berkeley, LA, San Diego) are (1) Mission (2) Gunn (3) Saratoga (4) Lynbrook (5) Leland.
Although Harker has the highest share of users going to top privates (30%), their share of users going to the top UC’s is below average. It’s worth nothing that Harker’s tuition is the highest at $36K a year.
Bellarmine, an all men’s high school with tuition of $15K a year, is below average in its share of users going on to top private universities as well as to the UC system.
Gunn has the highest share of users (11%) going on to Stanford. That’s more than 2x the second place high school (Harker).
Mission has the highest share of users (31%) going to the top 3 UC’s and to UC Berkeley alone (14%).

Career Statistics

In rank order (1) Saratoga (2) Bellarmine (3) Leland have the biggest share of users which hold job titles that allude to leadership positions (CEO, VP, Manager, etc.).
The highest share of lawyers come from (1) Bellarmine (2) Lynbrook (3) Leland. Gunn has 0 lawyers and Harker is second lowest at 6%.
Saratoga has the best overall balance of users in each industry (median share of users).
Hardware is fading – 5 schools (Leland, Gunn, Harker, Mission, Lynbrook) have zero users in this industry.
Harker has the highest share of its users in the Internet, Financial, and Medical industries.
Harker has the lowest percentage of Engineers and below average share of users in the Software industry.
Gunn has the highest share of users in the Software and Media industries.
Harker high school is relatively new (formed in 1998), so its graduates are still early in the workforce. Leadership takes time to earn, so the leadership statistic is unfairly biased against Harker.

You can see all the stats I collected in the table below. Keep in mind that percentages correspond to the share of users from the high school that match that column’s criteria. Yellow highlights correspond to the best score; blue shaded boxes correspond to scores that are above average. There are quite a few caveats which I’ll note in more detail later, so take these results with a grain of salt. However, as someone who grew up in the Bay Area his whole life, I will say that many of these results make sense to me.

A Comparison of Open Source Search Engines

Updated: sphinx setup wasn’t exactly ‘out of the box’. Sphinx searches the fastest now and its relevancy increased (charts updated below).

Motivation

Later this month we will be presenting a half day tutorial on Open Search at SIGIR. It’ll basically focus on how to use open source software and cloud services for building and quickly prototyping advanced search applications. Open Search isn’t just about building a Google-like search box on a free technology stack, but encouraging the community to extend and embrace search technology to improve the relevance of any application.

For example, one non-search application of BOSS leveraged the Spelling service to spell correct video comments before handing them off to their Spam filter. The Spelling correction process normalizes popular words that spammers intentionally misspell to get around spam models that rely on term statistics, and thus, can increase spam detection accuracy.

We have split up our upcoming talk into two sections:

Services: Open Search Web APIs (Yahoo! BOSS, Twitter, Bing, and Google AJAX Search), interesting mashup examples, ranking models and academic research that leverage or could benefit from such services.

Software: How to use popular open source packages for vertical indexing your own data.

While researching for the Software section, I was quite surprised by the number of open source vertical search solutions I found:

Lucene (Nutch, Solr, Hounder), Sphinx, zettair, Terrier, Galago, Minnion, MG4J, Wumpus, RDBMS (mysql, sqlite), Indri, Xapian, grep …

And I was even more surprised by the lack of comparisons between these solutions. Many of these platforms advertise their performance benchmarks, but they are in isolation, use different data sets, and seem to be more focused on speed as opposed to say relevance.

The best paper I could find that compared performance and relevance of many open source search engines was Middleton+Baeza’07, but the paper is quite old now and didn’t make its source code and data sets publicly available.

So, I developed a couple of fun, off the wall experiments to test (for building code examples – this is just a simple/quick evaluation and not for SIGIR – read disclaimer in the conclusion section) some of the popular vertical indexing solutions. Here’s a table of the platforms I selected to study, with some high level feature breakdowns:

High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations (please feel free to comment). — High level feature comparison among the vertical search solutions I studied; The support rating and scale are based on information I collected from web sites and conversations. I tested each solution's latest stable release as of this week (Indri is TODO).

One key design decision I made was not to change any numerical tuning parameters. I really wanted to test “Out of the Box” performance to simulate the common developer scenario. Plus, it takes forever to optimize parameters fairly across multiple platforms and different data sets esp. for an over-the-weekend benchmark (see disclaimer in the Conclusion section).

Also, I tried my best to write each experiment natively for each platform using the expected library routines or binary commands.

Twitter Experiment

For the first experiment, I wanted to see how well these platforms index Twitter data. Twitter is becoming very mainstream, and its real time nature and brevity differs greatly from traditional web content (which these search platforms are overall more tailored for) so its data should make for some interesting experiments.

So I proceeded to crawl Twitter to generate a sample data set. After about a full day and night, I had downloaded ~1M tweets (~10/second).

But before indexing, I did some quick analysis of my acquired Twitter data set:

# of Tweets: 968,937

Indexable Text Size (user, name, text message): 92MB

Average Tweet Size: 12 words

Types of Tweets based on simple word filters:

Out of a 1M sample, what kind of Tweet types do we find? — Out of a 1M sample, what types of Tweets do we find? Unique Users means that there were ~600k users that authored all of the 1M tweets in this sample.

Very interesting stats here – especially the high percentage of tweets that seem to be asking questions. Could Twitter (or an application) better serve this need?

Here’s a table comparing the indexing performance over this Twitter data set across the select vertical search solutions:

Indexing 1M twitter messages on a variety of open source search solutions; measuring time and space for each. — Indexing 1M twitter messages on a variety of open source search solutions.

Lucene was the only solution that produced an index that was smaller than the input data size. Shaves an additional 5 megabytes if one runs it in optimize mode, but at the consequence of adding another ten seconds to indexing. sphinx and zettair index the fastest. Interestingly, I ran zettair in big-and-fast mode (which sucks up 300+ megabytes of RAM) but it ran slower by 3 seconds (maybe because of the nature of tweets). Xapian ran 5x slower than sqlite (which stores the raw input data in addition to the index) and produced the largest index file sizes. The default index_text method in Xapian stores positional information, which blew the index size to 529 megabytes. One must use index_text_without_positions to make the size more reasonable. I checked my Xapian code against the examples and documentation to see if I was doing something wrong, but I couldn’t find any discrepancies. I also included a column about development issues I encountered. zettair was by far the easiest to use (simple command line) but required transforming the input data into a new format. I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) and sphinx given their strict input constraints. sphinx also requires a conf file which took some searching to find full examples of. Lucene, zettair, and Xapian were the most forgiving when it came to accepting text inputs (zero errors).

Measuring Relevancy: Medical Data Set

While this is a fun performance experiment for indexing short text, this test does not measure search performance and relevancy.

To measure relevancy, we need judgment data that tells us how relevant a document result is to a query. The best data set I could find that was publicly available for download (almost all of them require mailing in CD’s) was from the TREC-9 Filtering track, which provides a collection of 196,403 medical journal references – totaling ~300MB of indexable text (titles, authors, abstracts, keywords) with an average of 215 tokens per record. More importantly, this data set provides judgment data for 63 query-like tasks in the form of “<task, document, 2|1|0 rating>” (2 is very relevant, 1 is somewhat relevant, 0 is not rated). An example task is “37 yr old man with sickle cell disease.” To turn this into a search benchmark, I treat these tasks as OR’ed queries. To measure relevancy, I compute the Average DCG across the 63 queries for results in positions 1-10.

Performance and Relevancy marks on the TREC OHSUMED Data Set; Lucene is the smallest, most relevant and fastest to search; Xapian is very close to Lucene on the search side but 3x slower on indexing and 4x bigger in index space; zettair is the fastest indexer. — Performance and Relevancy marks on the TREC-9 across select vertical search solutions.

With this larger data set (3x larger than the Twitter one), we see zettair’s indexing performance improve (makes sense as it’s more designed for larger corpora); zettair’s search speed should probably be a bit faster because its search command line utility prints some unnecessary stats. For multi-searching in sphinx, I developed a Java client (with the hopes of making it competitive with Lucene – the one to beat) which connects to the sphinx searchd server via a socket (that’s their API model in the examples). sphinx returned searches the fastest – ~3x faster than Lucene. Its indexing time was also on par with zettair. Lucene obtained the highest relevance and smallest index size. The index time could probably be improved by fiddling with its merge parameters, but I wanted to avoid numerical adjustments in this evaluation. Xapian has very similar search performance to Lucene but with significant indexing costs (both time and space > 3x). sqlite has the worst relevance because it doesn’t sort by relevance nor seem to provide an ORDER BY function to do so.

Conclusion & Downloads

Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.

Keep in mind that these experiments are still very early (done on a weekend budget) and can/should be improved greatly with bigger and better data sets, tuned implementations, and community support (I’d be the first one to say these are far from perfect, so I open sourced my code below). It’s pretty hard to make a benchmark that everybody likes (especially in this space where there haven’t really been many … and I’m starting to see why :)), not necessarily because there are always winners/losers and biases in benchmarks, but because there are so many different types of data sets and platform APIs and tuning parameters (at least databases support SQL!). This is just a start. I see this as a very evolutionary project that requires community support to get it right. Take the results here for what it’s worth and still run your own tuned benchmarks.

To encourage further search development and benchmarks, I’ve open sourced all the code here:

http://github.com/zooie/opensearch/tree/master

Happy to post any new and interesting results.

Surviving a Lunch Interview

I always found lunch interviews to be the most frustrating experiences ever. There you are, given an opportunity to pig out in a grand cafeteria on corporate expense – so naturally, you stock up the tray to get your chow down. You sit down at the table across from your interviewer, and right as you’re about to take that first scrumptious bite, your interviewer asks you a question. You of course answer it completely, but before returning to the meal you’re asked a follow-up question, and then another one, and before you know it rapid fire Q/A begins. You do your best to answer each one … as your food gets cold … as your stomach growls … and as you watch the interviewer nodding to your comments with his/her mouth filled with that savory steak and potatoes you’re dying to devour. Why can’t the interviewer just go to the bathroom or receive a cell call already?!

This isn’t the interviewer’s fault by any means. After all, it is an interview, and their role compels constant question asking (silence is awkward). Additionally, this whole food tease leading to short-term starvation isn’t the worst consequence. You can get food stuck in your teeth, pass gas, get bad breath, spill your food and drink all over your interviewer, etc. It’s probably the most dangerous, error-prone part of the interview process (actually probably not … since you typically don’t get asked technically involved questions over food).

So here’s some advice to those who find themselves in similar situations. Sadly, it took me nearly three years of lunch interviews to discover these pointers:

Eat a big breakfast. Lunch should be a snack.
When you do eat lunch, order the soup with bread. Warms your body and soothes your throat. Simple to eat. Nothing gets stuck to your teeth. No need to wash the hands, so hands don’t get dirty for that final handshake. It’s not greasy (like pizza) so doesn’t reflect bad diet habits to your interviewer. Also, the bread soaks in the soup to make the meal filling plus give you additional energy for the rest of the day.
Eat slowly, since your interviewer probably got more food than you. You don’t want to finish earlier than him/her. It tends to rush the other person. Your goal is to make the lunch round long and fun. Keep the conversation going but don’t over do it to the point where the interviewer starts to daze off. Ask questions when the interviewer runs out of questions (also gives you more time to eat!). Make the most of lunch to learn as much as you can about the group. Their insight will be super useful in the upcoming rounds. Just think of lunch as a break before the more technical rounds.
Drink water. It really is the best drink ever. No chance of an upset stomach during or after the round. If you’re starving and know the soup + bread won’t fill you up (eating slowly helps fill you up though), get an Odwalla. It’s seriously a second meal.
Don’t take notes. That’s too much IMHO. Keep it informal, unless the interviewer specifies otherwise.

That’s all I got. Nothing crazy.

Anyways, hope these pointers come in handy.

Vik's Blog

Posts that pay homage to Jim Gray's "Let's party on the data" line.

Job Stuff