一亩三分地论坛

 找回密码
 获取更多干货,去instant注册!

扫码关注一亩三分地公众号
查看: 885|回复: 4
收起左侧

[DataScience] The Data Revolution GUEST POST BY LEO POLOVETS

[复制链接] |试试Instant~ |关注本帖
小K 发表于 2014-1-25 09:57:37 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?获取更多干货,去instant注册!

x
http://blog.relateiq.com/the-data-revolution/

Leo Polovets has an impressive 10-year background as an engineer at LinkedIn, Google, and Factual. You can now find him as a partner at Susa Ventures. Below is a post from his blog detailing a tech talk given at RelateIQ by DJ Patil. For similar posts, make sure to check out Leo’s blog, Coding VC.

Yesterday I went to a great talk by DJ Patil called “Building Great Data Products.” DJ has an impressive background in building data-centric products: he was head of LinkedIn’s data products for several years, then Data Scientist in Residence at Greylock, and is now the VP of Product at RelateIQ, a CRM tool whose homepage reads, “The Beginning of Data Science in Decision Making.” He also co-coined the term Data Scientist.

DJ discussed some of the lessons he learned while building products at LinkedIn and RelateIQ, and the following is a summary of my notes from the talk:

  • Don’t try to be too clever. Simple, straightforward approaches beat cleverness 9 times out of 10.
  • Start with something simple, then make it more complex if necessary. Don’t start with something complex and then simplify.
  • The hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.
  • Try to get clean data from the front end (i.e. the user) instead of cleaning it on the backend. For example, if you’re trying to figure out what company someone works for, it’s easier to guide them with auto-complete or “did you mean ___?” suggestions, rather than accepting whatever they type and trying to understand it later. You’d be surprised at the number of ways in which people can input the same thing if you don’t give them any guidance.
  • Use humans in general and Mechanical Turk specifically for early versions of your product, then try to automate and streamline as desired.
  • Build easy products first. For example, start with collaborative filtering before diving into fully personalized recommendations.
  • Showing users their own data via charts, blog posts, etc. is a great way to engage them.
  • When showing data, think about 1) what you want the viewer to take away, 2) what actions you want them to take, 3) and how you want them to feel. UX is very important. Don’t overload people with too much information or creep them out with inappropriate details.
  • Set user expectations low. If you set high expectations and screw up, it’s very hard to regain a user’s trust. For example, if you tell someone, “We know you will love XYZ!” and they don’t like XYZ, they’ll be skeptical of your future recommendations — or even ignore them. If you reframe as, “Are you interested in XYZ? No? Okay, sorry!” then users will be more forgiving.
  • Unfortunately, the best way to test data products is in production. It’s the only way to find out if your recommendations are effective and to learn about all of the warts and corner cases that lead to embarrassing mistakes. For example, how do you tell if your product suggestions are good? Show them to users and measure the effect that they have on spending/engagement/whatever you’re hoping to improve.
  • Simple beats clever 9 times out of 10, but you need to be able to recognize when to build something SOPhisticated.
  • Try to augment humans and make them more efficient instead of trying to replace them. People generally dislike feeling unnecessary or replaceable.
  • Minimize the friction in your product. If you’re asking users to answer questions or input data, make that as easy and painless as possible — otherwise users won’t do it. Nobody reads manuals and instructions anymore. Strive to make products that are as intuitive as the iPad or Angry Birds.
  • Rule of thumb: every time you ask for data, your conversion funnel takes a 10% hit. Try to keep all questions lightweight and easy to answer so that you can minimize the damage.

DJ’s talk was great and I can vouch for many of these lessons personally based on my work at Google and Factual. I think the most valuable lesson that I’ve learned over the last decade is one that came up repeatedly during the talk: simple approaches are often surprisingly effective. For example, I remember one task where I had a sparse dataset and had to fill out as much of the missing data as possible. Instead of using fancy algorithms and sophisticated machine learning, I tried the following heuristic: for every pair of columns, if a value, X, in one column was associated with a value, Y, in another column almost all of the time, then every time the first column value was X and the second column value was missing, I’d set the second column value to Y. It was a very naive approach, and yet it managed to fill in a large chunk of my dataset. I’ve now used this heuristic for data about books, movies, places of interest, and other datasets, and it often makes more clever strategies unnecessary or not worth the time.

Another important lesson that I’ve learned is that it’s a great idea to work with small samples of data and use a single machine for as long as possible. Hadoop and distributed systems are nice when you’re running in production on terabytes of data, but they greatly diminish your development speed and ability to experiment. You’ll probably make progress much more rapidly if you just load a 500MB slice of data into RAM and experiment on your laptop.

After DJ’s talk, I started thinking about the many blog posts that I’ve seen that focus on technologies that are commonly used for working with data: Hadoop, scipy, regular expressions, etc. I’d love to see more blog posts (and books) about higher level strategies and tactics for building data products. Posts that offer suggestions like “work with small samples”; “leverage Mechanical Turk”; and “start with the simplest approaches.” I might turn a few of these topics into future blog posts, but I’m sure there are many lessons that I haven’t learned yet. If you know of any great resources for creating data products, please mention them in the comment section!


评分

1

查看全部评分

本帖被以下淘专辑推荐:

pureds 发表于 2014-1-27 17:44:32 | 显示全部楼层
thanks for sharing
回复 支持 反对

使用道具 举报

anonym 发表于 2014-1-28 08:51:53 | 显示全部楼层
K姐 等你也成了partner记得拉我一把
回复 支持 反对

使用道具 举报

 楼主| 小K 发表于 2014-1-29 03:57:14 | 显示全部楼层
回复 支持 反对

使用道具 举报

FTD2014 发表于 2014-1-29 04:21:31 | 显示全部楼层
小K 发表于 2014-1-29 03:57
猴年马月。。。。
. 1point 3acres 璁哄潧
两年后就猴年了,K姐加油!
回复 支持 反对

使用道具 举报

本版积分规则

请点这里访问我们的新网站:一亩三分地Instant.

Instant搜索更强大,不扣积分,内容组织的更好更整洁!目前仍在beta版本,努力完善中!反馈请点这里

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明 ( 沪ICP备11015994号 )

custom counter

GMT+8, 2016-12-8 14:59

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表