管理员
- 积分
- 76922
- 大米
- 颗
- 鳄梨
- 个
- 水井
- 尺
- 蓝莓
- 颗
- 萝卜
- 根
- 小米
- 粒
- 学分
- 个
- 注册时间
- 2009-5-28
- 最后登录
- 1970-1-1
|
本楼: |
👍
99% (132)
|
|
0% (1)
👎
|
全局: |
👍 94% (13159) |
|
5% (702) 👎 |
注册一亩三分地论坛,查看更多干货!
您需要 登录 才可以下载或查看附件。没有帐号?注册账号
x
机器学习练成记录请移步这里
====================
数据科学:简单说就是,不要靠拍脑袋下结论,要以数据为根据,让事实说话。
能力范畴3个词:统计,编程,表述
A PhD Data Scientist: Jack of All trades, master of one.
. Waral dи,
展开说:统计(能探索数据,建模,设计实验),
编程(能取数据,洗数据,至少能Prototype自己的data solution,懂基本大数据工作原理(MapReduce)),.--
表述(化繁为简,口头Present,书面写报告和论文,作图(静态和web))
. 1point 3acres
简历上(+脑子里)如果有这些:你找工作基本没有问题:
Ttest, Regression, ANOVA, Logistic Regression, DOE, Machine Learning, Data Mining, MapReduce, SQL, R/Matlab, Python, Java. 1point 3 acres
. 1point3acres=========================================
本文主要针对IT类行业做数据科学 It does not define a data engineer. Rather, it's a close call to a "full-stack data scientist". Master this list and you will not only be able to work for established firms, but startups too.
其他偏重传统行业应用的,应该对表述要求稍高,对其他要求稍低。
面试之前请务必花1周时间学习对方行业的基本内容,wikipedia即可,起码做到熟悉对方行业常用关键字。
如果目的就是有份还可以的工作,请照单子静下心学习。
如果你希望做的很好,三个方面请突出至少一个方面。
要学过来,需要很多时间,如果希望不太费力就做data scientist, OK, dream on!
.
请不要mark一份学习清单就.Equals(学习任务已经完成了)一样,一起来学起来吧~~~~~~【墙裂建议贴出你的学习计划,大家一起监督讨论,几位版主有空也会来给建议,坚持下来的有积分奖励】
=========================================
我做了一套在线课程,跟大家分享自己的经验,希望对大家有帮助
1. Overview (2小时的分享)
第一场2000多人参加,以后考虑不定期重开. From 1point 3acres bbs
2. Analytics (8小时课程)
为期一个月的一轮课程,. check 1point3acres for more.
3. Experimentation (8小时课程)
为期一个月的一轮课程,
以后根据需求会有新的offering,敬请关注.
4. 其他课程还在开发中,
有兴趣的同学可以留言,说明你希望看见的内容和其他建议等
=========================================
如果有不清楚的请多google.
=========================================
差不多一年前看市面工作还是很混杂的样子,今天又翻了翻,估计年底账目清算,很多公司很多新职位出来了,职位要求解析在此
感觉现在data scientist/researcher之类职位针对性更强,能更清楚看出来到底对方需要的是什么样的人:是啥都会一点的,还是会点统计的码农,还是Machine learning,还是优化、logistics 供应链,还是会点编程的统计师。
(data business person 一般不叫data scientist) 主要用SQL产生报表的BI analyst 也不在此列。. Χ
学习列表一来是准备面试用,二来本来平时就是要用的。我自己学完的mark as green
=========================================
打算把我自己学的一些东西总结在这里欢迎补充。不定期汇总到首楼。
如果你想收藏本帖请点首楼下方的“收藏” -》 确定 -》 然后文章会出现在 “快捷导航”-》收藏里面
如果没有啥具体内容要补充的,请不必回帖了。想加分的可以加分,不加也无所谓。. .и
请别问我某校的data science项目如何,你三围如何能否上某校。I have no idea. . Waral dи,
=========================================. Waral dи,
基本上是must have:
.
统计Statistics 统计和机器学习
hypothesis testing, point/interval estimation
pvalue, power, (type 1/2 error)
clt, delta method, derive coef and var(coef) etc
t-test: assumptions, remedy. 适用问题范围basics listed above 请看这个课 http://onlinestatbook.com/2/index.html
glm (lm, logistic regression, anova etc):asssumptions, model selection and validation, diagnostics, remedy 适用问题范围
times series Forecast with R
Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics)
and its Upitt course
bayesian-baidu 1point3acres
Bayesian for hackers (python)
Coursera Graphical Model (VERY nicely explained)
Bayesian reasoning and machine learning book (quite difficult to read)
入门:A first course in Bayes 一下就看完了,很不错
.google и
longitudinal, mixed model
doe:all kinds of design, response surface
(?)survival
Machine Learning Coursera Andrew Ng. 1point3acres.com
Stanford Statistical Learning (Tibshrani & Hastie) -- 本书还出了一个本科版,着重动手实践,大量R, very easy to read. recommend starting from here.
Caltech那个learning from Data我没能跟下来Please, make sure you know your logistic regression inside and out!. 1point3acres
. Χ
Deep Learning:
See my separate thread here: http://www.1point3acres.com/bbs/ ... 1&extra=#pid2601595
Learn recommender system
Learn some NLP ..
Make sure you KNOW how things work, not just how to call a certain package in a certain language!!!. Waral dи,
Experimental Design / Causal Inference
This is somewhat a niche area. But as a DS, you will most likely deal with some AB tests, if you are with a reputable internet company. It is not just using some tool to compute power for a chisquare test or t-test. Be sure you know the difference between observational study and designed experiment. Be sure you know when to use which. Students from biostat/epi background will have an edge here. If you are able to handle very complex expt design, then you are opening many doors -- think multi-sided Marketplace and interfering subjects (Uber/Lyft, Airbnb, eBay), Social network (Snap/FB/Linkedin) problems, think about problems that can't cleanly randomize users (opt-in, marketing campaign, mobile app feature roll out).
Optimization & MoreIntro to linear programming https://www.math.ucla.edu/~tom/LP.pdf Good and easy read.
See see.stanford for additional courses on convex opt.
Prof. Ferguson also has some good reading material on game theory https://www.math.ucla.edu/~tom/Game_Theory/
Udacity Intro to AI is a great course (also one of the very first MOOC in this world) that connects the many concepts together, including particle filters, Kalman filter, HMM etc.
统计软件Statistical Computing: R/Matlab/Python. SAS(?)
R and Matlab 基本被业界认为是等同的。不过Matlab is not free, Octave is free 但是不是那么好用。请考虑自学R。反正你会Matlab 的话pick up R 也就分分钟的事情。
如果其他语言一个都不会,只会SAS Base/Stat,并且你也不想学其他的,那也许数据科学不适合你。如果你非要用SAS不可,请你至少写过macro。SAS的确在大数据的建模里面非常有用,但是跟其他行业差距较大,如果组里其他人都是R/Py/Java 你跟他们交流起来会异常困难。另外软件很贵,很多地方未必愿意买。. .и
注意,我说的是,会SAS是好事,但是不能仅仅只会SAS.
Python: Data Analysis with Python (book), pandas
R: data.table, or plyr, lubridate, reshape2, build a R package, there are now lots of such courses on both udacity and coursera. Start from any.
know how to get data from any source (DB, web, xml, plain text, etc)
EDA (exploratory) - Descriptive stats udacity
Inference - udacity
Plot/explain
read code from your favorite packages ..
-----------------------------------------------------
编程 : A compiled language, and a scripting language
Python
我比较偏好Udacity一遍教一遍做quiz 的方式,光做题不讲(codecademy)我自己好像学不清楚 Udacity CS101
Udacity CS 215 (Algorithm, 比Coursera Princeton and Stanford要简单,快速过一遍不错). 1point 3 acres
Udacity (Peter Norvig) CS212 Design of a Computer Program 非常好,强烈推荐
Java 数据结构和算法.--
1. Udacity java (这门课我花了40小时学完)适合连什么是函数什么是赋值都不知道的人。
2. Data structure 数据结构建议必学 python: Problem Solving with Algorithms and Data Structures)
Java: Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
教材是Head First Java & Data Structures and Algorithms in Java,
my progress bar: week 5, lab1, hw1.
3. Algorithm: Udacity Algo in Python 比较laid back,如果不太希望费劲,可以上这个课,不过还是严肃点好。。。. 1point 3 acres
Java Coursera Algo I&II (Princeton),如果对这个话题有兴趣,
不限语言 Stanford Algo I&II也很好,两者不可相互代替。. ----
.
很少会有人学的第一门语言是C#,所以C#还真没有什么特别入门的书,不推荐。如果没从前没学C, java, C++直接看C#的书简直无法理解
C++比较难,对data scientist 来说应用也没有java广。当然如果你是大牛,plz当我没说。. 1point 3acres
Design pattern:地里同学推荐的:
http://courses.caveofprogramming ... ns-and-architecture
. https://www.youtube.com/playlist?list=PLF206E906175C7E07
.google и
根据我组里面试别人,和我在其他地方面试,量化一下:数科的编程到底需要什么水平?. Waral dи,
我假定你有了上述其他的全部功底,除非职位特别强调是统计师,或者叫Data scientist, statistics/analytics,并且职位说明里面对代码完全一带而过,你都可以假设,是需要一些代码能力的 。
具体水平是:. .и
IT公司数科:Leetcode Medium要可做。所以,刷题吧。
传统公司:不知道. 1point 3 acres
如果你是码农出身,或者做更偏向data engineer的,要求会更高
涉及知识点包括并且不限于:
浮点溢出
边界情况考虑
改进MapReduce算法(beyond brute force)
如果涉及大数据,对时间复杂度要求会比较高 Binary search, and be prepared to talk about complexity
very basic DFS/BFS
reservoir sampling. 1point3acres
string manipulation
if DP dynamic programming is ever asked, it will be very basic . 1point 3acres
basic data structures
Most likely you don't need leetcode hard
-- 其他我想起来了慢慢补
顺手学掉的小零碎:. 1point3acres.com
Regex (a couple of hours) http://deerchao.net/tutorials/regex/regex.htm
..
SQL (a week) http://www.w3schools.com/sql/ Coursera: Intro to DB
SQL 面试要考到什么程度?
如果JD是DS, 我自己没见过特别特别恶心的,但是肯定需要懂
JOIN 比 subquery 快
COUNT DISTINCT
WHERE. Χ
GROUP BY
HAVING 什么的
advanced: 需要懂 windowing function. 1point 3acres
随便google一个准备sql面试的link 都有这些信息
SQL必须是面试过程中最没有悬念的一段了
大数据:
MapReduce: some knowledge Udacity series: http://blog.udacity.com/2013/11/sebastian-thrun-launching-our-data.html)
Coursera: intro to Data Science
Coursera: Big data and web intelligence
learning by doing --- yes! wrote my very first reducer for real life projects! MongoDB (udacity) (NOSQL)
.1point3acres
Spark/Scala - try this book: Advanced Analytics with Spark (very doable and easy to follow, superb examples)
Scala推荐 Coursera: functional programming in scala - 超级好
Spark MOOC http://www.1point3acres.com/bbs/thread-135600-2-1.html
Book: Learning spark
Basic Engineering: https://see.stanford.edu/CourseIt also has great content on optimization, which is harder to find elsewhere.
If your want to be a DS for IT firms, then Maybe:. 1point3acres.com
jquery/ajax (start from codecademy very simple js and jquery intro, then find books) w3c school one is also really good.
-----------------------------------------------------
.. web services get basic idea of how browsers work (udacity - Website Performance Optimization)
udacity web development (build a blog) (40 hours)
-----------------------------------------------------
SE
Software Development Life Cycles (udacity, mostly videos, as a quick intro only), amazingly, this one filled lots of holes in my knowledge base. Highly recommend
Also a book is mentioned here, worth a quick flip through, unfortunately, no ebook that I found works. Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts-Refactoring_ Improving the Design of Existing Code
-- this is helpful not only for working in IT, but helps overall coding style/efficiency as well. Wished I'd known earlier.
. 1point3acres.com -----------------------------------------------------
Linux
Many servers are in linux. at least familiarize yourself with the command line stuff. There's a not so good course on Edx.
Basic shell script or similar
jq, sed, awk.
----------------------------------------------------- ..
综合/分析/表述/软技能.google и
软技能难以表述,
技巧不是最重要,想清楚再开口才是关键。突然发现我导师的lab页面竟然是用这些问题开头,深感心有戚戚。
化繁为简,高屋建瓴的表达能力:hide complex formula/engineering details,尽量传达big picture. 1point 3acres
个人经验是,习得这些能力最好的办法是:去讲,不要自顾自的讲话,请随时关注听众是否听懂,鼓励对方马上提问,回答问题要选取符合对方背景的关键字,而不是“自己熟悉”的关键字。不要用缩写,小范围术语。多讲清楚intuition,少堆积公式。
1. 教一门自己专业的入门课,e.g 统计学生,去给其他专业的人讲入门统计,例子:请给完全不懂统计的人讲,什么是pvalue, power, false positive, randomization, inference etc. . 1point3acres
2. Consulting - 有些学校会有这种session,别觉得浪费时间,去把别人讲懂,去看看别人用你的专业技术做什么问题,他们的思路跟你哪里不同,你如何理解他们,如何让他们理解你。
3. 做presentation - 不要像专业学术会议上那样去讲,要向给别人上101课那样讲。讲的目的,不是展示你的专业多么复杂深奥,不是为了impress others with your techinal prowess,而是让对方懂,最终听取你的建议。
Data Journalism (course, starting early 2014) --- it was not as good as I expected. I do not recommend it.
作图,静态的最好能会ggplot (a few hours), 动态的d3,如果你会javascript, also great!, 推荐读
Nathan Yau: books visualize this & Data points, and his flowing data blog
.-- for d3: Interactive Data Visualization for the Web . free online tutorial by author: http://alignedleft.com/tutorials/d3/about 真的没那么难
作图是否好看并不是关键所在,选用合适的图标来帮助解释道理才比较重要
html (a few hours, w3c)
css (a few hours, w3c), or codecademy, or the d3 book mentioned above
javascript (codecademy as a start, a book to follow later)
Rcharts/highcharts
Udacity现在也有一门新开的vis课了
Prototype your data products: . .и
mean stack. https://thinkster.io/angulartutorial/mean-stack-tutorial/
. 1point 3acres 起码把AngularJS学了,这个不光做数科有用。
R open CPU. R Shiny (limited usage with free version). If you are not into Angular, try the flask+React stack, 上手的确很快
(关于flask, udacity有课,react自学即可,可以参考udacity 关于components的课). check 1point3acres for more.
虽然我们不是要做前段开发,但是看起来也得至少有个半吊子前段,请学习这MM的经验,超赞 http://www.1point3acres.com/bbs/thread-104335-1-1.html
Design: (optional but nice to know) 如果没有兴趣请至少看(组合起来好看的颜色) 如果你有兴趣让图好看,请花一个周末翻看这几本:
1. Before and After
2. Nondesigner's design book
3. Don't make me think
4. The Wall Street Journal Guide to Information Graphics
.
Research/publication:
sharelatex (invite enough users to get free versioning) /writelatex.com
Go to conferences, see what people are working on. Read their papers. .
如果你想找某些类型的工作,上linkedin找到组员,泛读他们的paper
Domain Knowledge: google/wikipedia is your friend
=========================================-baidu 1point3acres
整体思路:
Doing Data science (book). 1point 3acres
Data Science in Business
=========================================
other 一些我感觉不太费时间但是会有用的小东西
excel, power pivot etc
科普类的书:(都很简单易读)
大数据到底是啥???《Big Data: A Revolution That Will Transform How We Live, Work, and Think》
和很近似的一本 《Automate This: How Algorithms Took Over Our Markets, Our Jobs, and the World》
随便翻翻就好了
然后当然还有Nate Silver 《The Signal and the Noise: Why So Many Predictions Fail-but Some Don't》
=========================================
Case study: Twitter data analytics http://tweettracker.fulton.asu.edu/tda/
=========================================
有人推荐的 MS data science 学习curriculum http://datasciencemasters.org/
=========================================
大家给我推荐的帮助整理思路,用正确的方式做事的工具:It's more important than you think!!
http://software-carpentry.org/lessons.html
coursera reproducible research,学转knitr,不要copy paste anything
Udacity Git Course (最好,没有之一).
============================
最后,没有什么比亲自干活和得到feedback更有用。
数据科学是一种 apprenticeship model,找合适的人带着做事,成长会很快。.google и
. check 1point3acres for more.
|
评分
-
查看全部评分
上一篇: 【求助】这个分数是不是废了下一篇: Doing Data Science - by Rachel Schutt; Cathy O'Neil
本帖被以下淘专辑推荐:
- · DS/DA 找工求职经验|主题: 205, 订阅: 330
- · 数据科学|主题: 43, 订阅: 133
- · 数据科学Data Science/Analytics|主题: 96, 订阅: 104
- · QA/DS 实习or全职|主题: 984, 订阅: 99
- · BA|主题: 155, 订阅: 66
- · DS|主题: 86, 订阅: 33
- · 统计master|主题: 15, 订阅: 17
- · JOB|主题: 207, 订阅: 9
- · 成为数据分析师|主题: 5, 订阅: 8
- · Stats/DS|主题: 4, 订阅: 6
- · 更多
|