What I learned working under Turing Award winner Jim Gray – 10 years since his disappearance

A few days ago, I sent out the following email remembering Jim to close friends and colleagues. I did not intend to share this broadly, but I received many positive replies encouraging me to post this publicly … so, here it is:

10 years ago this month, my mentor and idol Jim Gray disappeared at sea. I had the greatest fortune to work under him. We had published a paper together in the weeks leading up to his final sail.

I learned so much from Jim, and I think about him a lot. We even incorporated our company name after his saying “party on the data” (Party On Data, Inc.). To this day, I continue to unpack and internalize the lessons that I absorbed while working with him more than a decade ago.

I learned it’s important to make time for the unexpected. It felt like nearly everyone I knew in my circle had talked to Jim at some point – even people from very different fields of study. I am not sure how he was able to be so generous with his time given his position and stature, but if someone reached out to him with an interesting hook and was passionate about it, he made time. And, it wasn’t just a meet and greet – he truly listened. He would be engrossed in the conversation, and listen intently like you were the professor and he was the student. He made you feel special; that you had some unique insight about a very important problem area.

Jim’s projects were proof that making time and having an open mind for the unexpected – to converse and collaborate with people beyond your direct connections – can lead to breakthroughs in other disciplines. He made significant contributions to the SkyServer project, which helped astronomers federate multiple terabytes of images to serve as a virtual observatory (a world-wide telescope). He applied a similar approach to mapping data with the Virtual Earth project (the precursor to Google Maps – minus the AJAX).

In today’s world, with so many distractions and communication channels (many of which are being inundated with spam), it has become commonplace to ignore cold inbound requests. However, I learned from Jim that it’s crucial to make time for surprises, and to give back. No other Turing award winner responded to my emails and calls – only Jim did – and by doing so, he completely changed my life for the better. Jim instilled confidence in me that I mattered in this world, if someone important like him was willing to invest his precious time with me.

I learned from Jim that it’s important to tackle very good and crisp problems – and to work diligently on them (and to write everything down). Jim had a knack for identifying great problems. Comb through his website – it’s hard to find a dud in his resume or project list. I remember we were in talks with a major hospital about an ambitious project to improve the detection of diseases. The hospital group was willing to support this high profile project in any way we needed (thanks to Jim being a rock star), but Jim immediately knew we wouldn’t be able to develop crisp, tangible results within a year. He wanted more control, and craved a project with more short-term wins.

When Jim did identify a crisp problem to work on, he went all-in. His work ethic was second to none. We were once at a baseball game together, and I could tell from his demeanor that he was itching to get back to the office to continue our work. If I was working late in the office, he would work late too. He remained technical (writing code right next to me) and deep in the weeds despite his senior management role. He was responsive. Late night emails to emails at 5 AM (he liked waking up with the birds). He pushed me to work harder – not by asking for it, but by leading by example.

With any project, but especially database projects, there are so many low-level, unsexy problems (like data cleaning) that have to be addressed before you can “party on the data.” “99% perspiration and 1% inspiration,” he would always say, like it was a constant, inevitable force of nature that we have to equip ourselves for. He prepared me for that, which taught me how to stay focused and work harder.

I learned that it’s important to learn about key inflection points from previous products and projects – to know your history in order to make better decisions. Jim was a master story teller – constantly reciting history. I still remember his story about how Sybase was outgunned in the database market, but their innovation with stored procedures gave them the differentiation they needed to fight the fight with DB2 and Oracle. And, by the way, he was very laudatory of key features coming from competitors. He would never dismiss them – he loved the innovation, no matter where it came from. He wanted the truth for how to best solve a particular problem.

He loved to teach his lessons too. I recall one time I asked him a technical question, and an important call came through to his desk phone. He immediately hung up the call and took me to the whiteboard to teach me what he knew about the topic in question. Who does that? You’d be lucky to meet with your thesis advisor or manager once a week for 30 minutes, but Jim was present for me like this almost every day.

Jim set the highest management bar imaginable for me. He showed me why I should optimize 100% for mentorship throughout my career – not company brand – and to do this every time.

I sometimes wish he could see me now, as I feel like I wasn’t able to show him everything that I could do then, as I was still in the infancy of my career. I know better now where I excel (and where I don’t). At the time, I wanted to learn and do it all, like there was no tomorrow. He encouraged me to follow my passions – even if they were outside his comfort zone. Jim had no ego – he would loop in another mentor who knew more about a particular subject area. He gave me rope to learn, fail and rebuild. I tried to savor every minute I had with Jim, and am thankful that I did.

Despite his amazing technical accomplishments, I honestly do not remember many of the technical concepts that he had taught me. What I remember is how he made me feel. That’s what lives on and matters most. He gave me confidence, by just responding to me, and of course, working side by side with me. He rewarded my proactive outreach (which certainly encouraged me to send many more cold emails thereafter), and most importantly, taught me how to approach and solve big problems.

Jim truly inspires me, and I am forever grateful for what he did for me and my career. I sincerely hope that one day, I too, can have such a profound positive influence on so many people’s lives.

To being tenacious like Jim.

Ranking High Schools Based On Outcomes

High school is arguably the most important phase of your education. Some families will move just to be in the district of the best ranked high school in the area. However, the factors that these rankings are based on, such as test scores, tuition amount, average class size, teacher to student ratio, location, etc. do not measure key outcomes such as what colleges or jobs the students get into.

Unfortunately, measuring outcomes is tough – there’s no data source that I know of that describes how all past high school students ended up. However, I thought it would be a fun experiment to approximate using LinkedIn data. I took eight top high schools in the Bay Area (see the table below) and ran a whole bunch of advanced LinkedIn search queries to find graduates from these high schools while also counting up their key outcomes like what colleges they graduated from, what companies they went on to work for, what industries are they in, what job titles have they earned, etc.

The results are quite interesting. Here are a few statistics:

College Statistics

  • The top 5 high schools that have the largest share of users going to top private schools (Ivy League’s + Stanford + Caltech + MIT) are (1) Harker (2) Gunn (3) Saratoga (4) Lynbrook (5) Bellarmine.
  • The top 5 high schools that have the largest share of users going to the top 3 UC’s (Berkeley, LA, San Diego) are (1) Mission (2) Gunn (3) Saratoga (4) Lynbrook (5) Leland.
  • Although Harker has the highest share of users going to top privates (30%), their share of users going to the top UC’s is below average. It’s worth nothing that Harker’s tuition is the highest at $36K a year.
  • Bellarmine, an all men’s high school with tuition of $15K a year, is below average in its share of users going on to top private universities as well as to the UC system.
  • Gunn has the highest share of users (11%) going on to Stanford. That’s more than 2x the second place high school (Harker).
  • Mission has the highest share of users (31%) going to the top 3 UC’s and to UC Berkeley alone (14%).

Career Statistics

  • In rank order (1) Saratoga (2) Bellarmine (3) Leland have the biggest share of users which hold job titles that allude to leadership positions (CEO, VP, Manager, etc.).
  • The highest share of lawyers come from (1) Bellarmine (2) Lynbrook (3) Leland. Gunn has 0 lawyers and Harker is second lowest at 6%.
  • Saratoga has the best overall balance of users in each industry (median share of users).
  • Hardware is fading – 5 schools (Leland, Gunn,  Harker, Mission, Lynbrook) have zero users in this industry.
  • Harker has the highest share of its users in the Internet, Financial, and Medical industries.
  • Harker has the lowest percentage of Engineers and below average share of users in the Software industry.
  • Gunn has the highest share of users in the Software and Media industries.
  • Harker high school is relatively new (formed in 1998), so its graduates are still early in the workforce. Leadership takes time to earn, so the leadership statistic is unfairly biased against Harker.

You can see all the stats I collected in the table below. Keep in mind that percentages correspond to the share of users from the high school that match that column’s criteria. Yellow highlights correspond to the best score; blue shaded boxes correspond to scores that are above average. There are quite a few caveats which I’ll note in more detail later, so take these results with a grain of salt. However, as someone who grew up in the Bay Area his whole life, I will say that many of these results make sense to me.

SDSS Skyserver Traffic

This past summer I worked at MSR alongside Dr. Jim Gray on analyzing the Skyserver’s (the online worldwide telescope portal) web and SQL logs. We just published our findings, which you can access here (MSR) or here (updated).

Still needs some clean-up (spelling, grammar, flow) and additional sections to tie up some loose ends, but it’s definitely presentable. Would love to hear what you guys think about the results (besides how pretty the graphs look :).

Are professors too paper-happy?

[Update (4/26/2006): Motivation]

I've received several emails/comments (mostly from researchers and professors) regarding this post and realized some may have misconstrued my intentions for writing this article. I don't blame them. The title sounds pretty controversial and this post is quite long – compelling readers to skim and consequently miss some of the points I'm trying to make. As I mention in the second paragraph, 'I'm a huge fan of publications. The more the better.' All I'm trying to do here is make the case for why I think researchers should ALSO use more informal avenues of communication, such as blogs, for getting their ideas out for public review/commenting. These sources would not only increase reach/access, but serve as great supplementary material to their corresponding conference papers – NOT as replacements or alternatives to publications. I received a very insightful idea from Hari Balakrishnan – What if authors, in conjunction with their papers, release informal 5-6 page summaries of their projects/ideas for online public review/commenting? I would love to have resources like these available to me when dissecting the insight/knowledge encapsulated in formalized papers. I think simple ideas like these would significantly improve reachability/access of our research and even encourage more creativity/questioning.

[End Update]

On Digg:


I recently read through a professor's CV, which under the 'Publications' section stated "… published over 100 conference papers." I then proceeded to the next page of the resume to scan through some of his paper titles, only to see a brand new section titled 'References'. Wait, where'd his publications go? I went back to see if I missed a page. Nope. Wow … why wouldn't he include his citations or mention notable pieces of work? I mean, saying you have 100+ publications gives me no value unless you're going to list them. Granted, that's an amazing accomplishment – writing a paper that advances (cs) research is not only time consuming and hard work, but it requires a whole new level of creativity. Additionally, conferences are getting super competitive (unless he's publishing in these conferences) – I hear many cases of famous, tenured professors/researchers getting their papers rejected. Now multiplying this paper process by 100 represents a significant amount of research, so I'm willing to bet this guy is pretty bright. However, this one line publication description gives me the impression that this professor aims for quantity in order to amplify his prestige. I also get this feeling after publishing the 100th paper he reached one of those hard set goals that gives one the sense of 'Mission Accomplished'. I guess I'm different – I'm all about quality. Personally, I'd be content with 2 publications in my life if one of 'em got me the Turing Award and the other a date with Heidi Klum – but that's just me 🙂

Interestingly (or oddly), this CV publication description also got me thinking about some of the things I hate about (cs) papers. But first, let me make it very clear – I am a huge fan of conference publications. The more the better. The peer review process of research papers is absolutely critical to advancing science. In this post, however, I would like to make the case for why I think all researchers should ALSO use more informal avenues of communication (such as blogs, wikipedia, etc.) for publicly broadcasting their ideas and results of their latest research.

So let's start this off with a laundry list of gripes I have about 'most' papers (with possible alternatives/solutions mixed in):

PDF/PS/Sealed/doc/dvi/tex formats

  • That load time for the viewer is killer – probably deters many from even clicking the link
  • Certain viewers render diagams/text differently
  • The document looks formal, intimidating, elitist, old, plain, hardcore, not happy to see you
  • Also makes documents feel permanent, finalized, not amenable to changes
  • Provides no interface for commenting by readers – and in many cases I find reader critiques more interesting/useful than the actual paper
  • And why can't we see the feedback/comments from the conference?
  • Not pretty – It's a lot of material to read, so why not make the presentation look happier? I seriously think a nice stylesheet/web layout for paper content would significantly improve readability and portability

It's a lot of work

  • Not only does one need to come up with a fairly original idea, but research, discuss, and analyze its empirical/theoretical implications
  • Needs support/citations
  • Papers are quite formulaic – I can easily guess the page that has the nice/pretty graphs
  • This structure imposes pretty big research burdens on the authors
  • This is a GOOD thing for paper quality
  • But a terrible method for prototyping (similar to how UI dev's quickly prototype designs to cycle in user feedback)
  • Doesn't provide professors/researchers a forum to quickly get ideas out nor to filter in comments
  • Also prevents researchers from spewing their thoughts out since they wish to formalize everything in papers
  • Now there are other journals/magazines which are informal to let researchers publish their brainstorming/wacky idea articles
  • But there's still a time delay in getting those articles published
  • They still have writing standards and an approval process
  • And I don't read them (Where do I find good informal CS articles online? Is anyone linking to these? Who's authoring them?)
  • In blogs, authors can be comedic, speak freely ("he's full of it", "that technique is a load of crap" – opinions like these embedded in technical articles would making reading so much more enjoyable), and quickly get to the point without having to exhaust the boring implementation details


  • Although papers normally include author emails, this means feedback is kept privately in the authors' inboxes, not viewable to the general public
  • Many papers/journals require subscriptions/fees


  • The main issue here is professors/researchers want to publish in prestigious papers/journals
  • Rather than waste their time with things like weblogs, which are perceived to be inherently biased and non-factual, and where the audience may seem 'dumber'
  • It's in their best interest to focus on publications – get tenure, fame, approval from experts in the field

I want informality

  • But it sucks for us ('us' being the general audience who may wish to learn about these breakthroughs)
  • I love hearing professors ramble off their ideas freely
  • I want to see commentary from readers and myself ON these articles
  • And I want to see these ideas get posted and updated quickly
  • I want to see these experts explaining their ideas in simple terms (whenever possible, if it's too hardcore then it's probably not very good bloggable material) and describe the real world impacts/examples
  • But unfortunately NONE of my favorite professors/researchers publish their ideas on blogs

Popularity and Reach

  • There isn't much linkability happening for papers (besides from class/professor web sites)
  • You don't see conference papers getting linked on Slashdot/Digg
  • If millions of geeks don't even want to link/read this stuff, how and why would others? (refer below to the concluding section for why I think this is important)
  • Papers are formal, old-school, and designed to be read by expert audiences
  • They are written to impress, be exhaustive/detailed, when in reality most of us are lazy, want to read something entertaining, and get to the dang point
  • Wouldn't it be nice if professors/researchers expressed their ideas/research in a bloggish manner whenever possible?
  • At best a blog post would be a great supplemental/introductorial reading to the actual paper
  • Even the conference panel can check the blog comments to see if they missed anything
  • Some professors express their ideas informally with lecture notes and powerpoint presentations
  • But again, these formats don't let others annotate it with their comments
  • They are mostly pdf/ps/ppt/doc's (ah that load time)
  • And lecture notes usually exist for core concepts, not experimental/recent ideas


  • Which brings me up another interesting idea
  • Where's THE place to learn about core, fundamental cs topics?
  • Or even slightly maturing cs algorithms/techniques?
  • I tend to follow this learning process:
    • Check Wikipedia
    • At best gives me a great introduction to a topic
    • If i need more, I use what I learned from wikipedia/other sources to build queries for searching lecture notes/presentations
    • Typically the interesting details are hidden in a set of lecture notes/papers/books
    • And which of these sources should I use? – I have to read many versions coming from different professors to piece together a clear story
  • Wouldn't it be nice to have a one stop shop
  • Wikipedia!
  • What if Universities MOTIVATED professors/researchers to collaborate on wikipedia to publish in-depth articles regarding their areas of interest?
  • This would be huge
  • Wikipedia is incredibly popular/useful/successful, so let's use it as the place where professors around the world can unite their knowledge
  • Would be the best supplement (even replacement) for lecture notes
  • And for more experimental/controversial topics, researchers can use individual blogs

A slight digression and concluding remarks: The Future of Computer Science Research

Two things:

  1. I strongly believe the future of (computer) science relies on making our stuff more accessible to others – requiring us to tap formal AND informal avenues of communication
  2. We need to be more interdisciplinary!

Many people (especially prospective computer science students) ask me:

What's left to invent?

  • Most research to them (and me) sounds like 'optimizations' based work
  • Is there anything revolutionary left to do? Has it all been accomplished?
  • This is a hard question to answer convincingly, especially since if there was something revolutionary left to do, someone (hopefully me) would be on it
  • It's also difficult to forsee the impacts of ongoing research
  • Big ideas/inventions are evolutionary, piecing together decades of work.

They even say … "if only I could go back in time and invent the TV, PC, Internet, Web Search …"

  • As if back in the day we were living in the stone age and there was so many things left to do
  • There are even more things left to do today!

Our goal should not be to come up with an idea that surpasses the importance of say the invention of the PC

  • (I'm not even sure if this is possible, and at best is an ill comparison to make)
  • But rather to learn about the problems people face NOW and use our knowledge of science to solve them
  • Research should be entrepreneurial: Find a customer pain and provide a method to solve it
  • Not the other way around: Playing with super geeky stuff for fun and hoping later it might have applications (there are exceptions to this, since it encourages depth/exploration of a field which may lead to an accidental discovery of something huge, but I still think for the most part this is the wrong way to go about research)
  • Your audience is the most important element of research
  • And the beauty of our field – IT is a common theme in every research area!
  • We're the common denominator!
  • We can go about improving research in any subject

Now getting back to focus …

  • We consider the PC, Internet, etc. to be revolutionary because they are intuitive, fundamental platforms
  • Current research adds many layers of complexity to these ideas in order to solve more difficult problems
  • Making things more and more complex
  • What we need to be doing better is making our complex systems easier to use
  • We need better integration
  • And expand our research applications into other fields, rather than just reinforcing them back into computer science areas
  • We need to be interdisciplinary!
  • We have yet to even touch the surface when it comes to exploring the overlaps of fields

Some of the things we've been brewing since the 1970's could seriously REVOLUTIONIZE other industries

  • Imagine machine learning a database filled with all past medical histories/records/diagnostic reports
    • So when presented with symptoms/readings of a new patient, the system will tell you what that patient is suffering from with high probability based off analyzing millions of records
    • This would dramatically decrease the number of cases of death due to bad diagnosis and lower medical/insurance costs since it wouldn't be necessary to run a bunch of expensive scans/tests and surgeries to figure out what's wrong with a patient
  • A similar system could do the same thing over economics, disease, environmental, hazards datasets so that scientists and policy makers can ask questions like 'In the next five years what region of the world has the highest probability of a natural disaster … or a disease epidemic?, etc.'
    • This would be huge in shaping policy priorities, saving human lives, and preparing/preventing disasters like Katrina from ever escalating
  • Or what about mobile ad hoc networks to let villages in the poorest regions of the world wirelessly share an internet connection
  • Or sensor networks to do environmental sensing, health monitoring in hospitals, traffic control, pinpointing the location of a crime, etc.
  • And plenty, plenty more

The attractive elements of computer science research is not necessarily the algorithms, nor the programming implementation

  • But rather the intelligence/knowledge it can provide to humans after partying on data
  • Information is what makes a killer application!
  • And every field has tons of interesting information
  • We need to INTERACT and learn more about the needs in other fields and shape our research accordingly
  • (Unfortunately, computer scientists have a somewhat bad reputation when it comes to any form of 'interaction')

So after hearing all this, the prospective science students complain:

  • "Now we have to learn about multiple fields – our research burden is so much bigger than the scientists' in the past"
  • This is a very silly argument for defending one's laziness to pursue (computer) science
  • I mean I much rather hear one of them 'I don't want to get outsourced to India' excuses 🙂
  • Here's why:
    • We got it GOOD
    • I mean, geez, we got Google!
    • I can't even begin to imagine what Einstein, Gauss, Turing, Newton, etc. could have accomplished if they had the information research tools we got
    • We could have flying cars, cleaner environment, robots doing our homework
    • Heck, they could have probably of found a way to achieve world peace
    • We take what we have for granted
  • Back in the day, where there was no email, no search, no unified, indexed repository of publications
  • Scientists had to do amazing amounts of research just to find the right sources to learn from
  • Travel around the world to meet with leading researchers
  • Read entire books to find relevant information
  • You see, scientists back then had no choice but to be one-dimensional, and those who went deep into different fields either had no lives or were incredibly brilliant (most likely both)
  • But we can do all this work they had to do just to prepare for learning in a matter of seconds, from anywhere!
  • It's utter crap to think we can't learn about multiple fields of interest, in-depth
  • And our research tools will only get better and better exponentially faster
  • But I don't blame these students for their complaint
  • We're still trained to be one-dimensional
  • Universities have specific majors and graduate programs (although some of these programs are SLOWLY mixing together)
  • Thereby requiring many high school students to select a single major on the admissions application
  • Even though their high school (which focuses on breadth) probably did a terrible job introducing them to liberal arts and science
  • [Side Note: Actually, this is one of things I never quite understood. How do students choose a major like EE/NuclearEng/MechEng on the admissions application, when their high school most likely doesn't offer any of those courses? Sounds like their decision came down to three factors (1) The student is a geek (2) Parent's influence (3) Money. 2/3 of these are terrible reasons to enter these fields. Even worse, many universities separate engineering from liberal arts, making it almost impossible for a student who originally declared 'Economics' to switch to BioE after taking/liking a biotech class – despite the fact that their motivation to enter the major is so much better. You should enter a field you love, and making the decision on which students can enter engineering majors based off high school, which gives you almost zero exposure to these fields, makes no sense. And we wonder why we have a shortage of good engineers … well one reason is probably because we don't let them into the university programs – 'them' being the ones who actually realized they liked this stuff when they got to college.]
  • Many job descriptions want students/applicants with skills pertaining to one area

But unfortunately the problems we now face require us to understand how things mix together

  • This is why it's very important we increase the levels communications between fields to promote interdisciplinary research
  • Like a simple example: Databases and IR
    • Both have the same goal: To answer queries accurately regarding a set of data
    • Despite this, they are seen as two traditionally separate areas
    • Database/IR people normally work on different floors/buildings at companies/universities
    • One system is based traditionally more on logic while the other is more statistical/probablistic
    • But exploiting their overlap could greatly improve information access (a la SQL Server 2005)
    • Notice this example refers to two subtopics under just computer science, not even two mainstream fields!
  • Just imagine the impacts of collaborating more with other fields of study

This is why I feel blogging, wikipedia, and informal avenues of communicating our thoughts/research/ideas is very important

  • So that people in other fields can read them
  • And for the laundry list of reasons above
  • Even if it's just a proof of concept technique with no empirical studies
  • An interested geek reader who came across the blog post might just end up coding it up for you
  • Could even inspire start-up companies

So the POINT: Do whatever you can to voice your ideas/research to the largest possible audience in the hopes of cross fertilizing research in many fields.

* So what's up with the parentheses around 'cs' and 'computer' in this post? Well, many of these points generalize to all (science) research publications 🙂

This work is licensed under a Creative Commons License