机器学习练成记录请移步这里.--
====================
数据科学:简单说就是,不要靠拍脑袋下结论,要以数据为根据,让事实说话。
能力范畴3个词:统计,编程,表述 . .иA PhD Data Scientist: Jack of All trades, master of one.
简历上(+脑子里)如果有这些:你找工作基本没有问题:
Ttest, Regression, ANOVA, Logistic Regression, DOE, Machine Learning, Data Mining, MapReduce, SQL, R/Matlab, Python, Java
. 1point 3 acres
-baidu 1point3acres
. ----
=========================================
本文主要针对IT类行业做数据科学 It does not define a data engineer. Rather, it's a close call to a "full-stack data scientist". Master this list and you will not only be able to work for established firms, but startups too.
其他偏重传统行业应用的,应该对表述要求稍高,对其他要求稍低。
面试之前请务必花1周时间学习对方行业的基本内容,wikipedia即可,起码做到熟悉对方行业常用关键字。
如果目的就是有份还可以的工作,请照单子静下心学习。
如果你希望做的很好,三个方面请突出至少一个方面。
要学过来,需要很多时间,如果希望不太费力就做data scientist, OK, dream on!
bayesian Bayesian for hackers (python)
Coursera Graphical Model (VERY nicely explained)
Bayesian reasoning and machine learning book (quite difficult to read)
入门:A first course in Bayes 一下就看完了,很不错
.1point3acres longitudinal, mixed model
doe:all kinds of design, response surface.
(?)survival
Machine Learning Coursera Andrew Ng
Stanford Statistical Learning (Tibshrani & Hastie) -- 本书还出了一个本科版,着重动手实践,大量R, very easy to read. recommend starting from here. -baidu 1point3acres
Caltech那个learning from Data我没能跟下来Please, make sure you know your logistic regression inside and out!
Deep Learning: See my separate thread here: http://www.1point3acres.com/bbs/ ... 1&extra=#pid2601595 . ΧLearn recommender system
Learn some NLP
Make sure you KNOW how things work, not just how to call a certain package in a certain language!!!
Experimental Design / Causal Inference This is somewhat a niche area. But as a DS, you will most likely deal with some AB tests, if you are with a reputable internet company. It is not just using some tool to compute power for a chisquare test or t-test. Be sure you know the difference between observational study and designed experiment. Be sure you know when to use which. Students from biostat/epi background will have an edge here. If you are able to handle very complex expt design, then you are opening many doors -- think multi-sided Marketplace and interfering subjects (Uber/Lyft, Airbnb, eBay), Social network (Snap/FB/Linkedin) problems, think about problems that can't cleanly randomize users (opt-in, marketing campaign, mobile app feature roll out).
-baidu 1point3acres Optimization & MoreIntro to linear programminghttps://www.math.ucla.edu/~tom/LP.pdf Good and easy read.
See see.stanford for additional courses on convex opt.
Prof. Ferguson also has some good reading material on game theoryhttps://www.math.ucla.edu/~tom/Game_Theory/ Udacity Intro to AIis a great course (also one of the very first MOOC in this world) that connects the many concepts together, including particle filters, Kalman filter, HMM etc.
统计软件Statistical Computing: R/Matlab/Python. SAS(?)
R and Matlab 基本被业界认为是等同的。不过Matlab is not free, Octave is free 但是不是那么好用。请考虑自学R。反正你会Matlab 的话pick up R 也就分分钟的事情。
如果其他语言一个都不会,只会SAS Base/Stat,并且你也不想学其他的,那也许数据科学不适合你。如果你非要用SAS不可,请你至少写过macro。SAS的确在大数据的建模里面非常有用,但是跟其他行业差距较大,如果组里其他人都是R/Py/Java 你跟他们交流起来会异常困难。另外软件很贵,很多地方未必愿意买。
注意,我说的是,会SAS是好事,但是不能仅仅只会SAS. . check 1point3acres for more.
Python: Data Analysis with Python (book), pandas
R: data.table, or plyr, lubridate, reshape2, build a R package, there are now lots of such courses on both udacity and coursera. Start from any.
know how to get data from any source (DB, web, xml, plain text, etc). 1point 3 acres
EDA (exploratory) - Descriptive stats udacity
Inference - udacity
Plot/explain
read code from your favorite packages
..
----------------------------------------------------- 编程 : A compiled language, and a scripting language.-- Python
我比较偏好Udacity一遍教一遍做quiz 的方式,光做题不讲(codecademy)我自己好像学不清楚 Udacity CS101 Udacity CS 215 (Algorithm, 比Coursera Princeton and Stanford要简单,快速过一遍不错)
Udacity (Peter Norvig) CS212 Design of a Computer Program 非常好,强烈推荐 . From 1point 3acres bbs Java 数据结构和算法
1. Udacity java (这门课我花了40小时学完)适合连什么是函数什么是赋值都不知道的人。
2. Data structure 数据结构建议必学 python: Problem Solving with Algorithms and Data Structures)
Java: Berkeley 61B http://www.cs.berkeley.edu/~jrs/61b/
教材是Head First Java & Data Structures and Algorithms in Java,
my progress bar: week 5, lab1, hw1.. Χ
3. Algorithm: Udacity Algo in Python 比较laid back,如果不太希望费劲,可以上这个课,不过还是严肃点好。。。.--
Java Coursera Algo I&II (Princeton),如果对这个话题有兴趣, ..
不限语言 Stanford Algo I&II也很好,两者不可相互代替。
. 1point 3acres 涉及知识点包括并且不限于:
浮点溢出
边界情况考虑. From 1point 3acres bbs
改进MapReduce算法(beyond brute force). .и
如果涉及大数据,对时间复杂度要求会比较高 Binary search, and be prepared to talk about complexity
very basic DFS/BFS
reservoir sampling
string manipulation
if DP dynamic programming is ever asked, it will be very basic
basic data structures
Most likely you don't need leetcode hard
-- 其他我想起来了慢慢补
.
顺手学掉的小零碎: Regex (a couple of hours) http://deerchao.net/tutorials/regex/regex.htm
SQL 面试要考到什么程度?
如果JD是DS, 我自己没见过特别特别恶心的,但是肯定需要懂. Χ
JOIN 比 subquery 快
COUNT DISTINCT.google и
WHERE
GROUP BY
HAVING 什么的-baidu 1point3acres
advanced: 需要懂 windowing function. 1point3acres
随便google一个准备sql面试的link 都有这些信息
Spark/Scala - try this book: Advanced Analytics with Spark (very doable and easy to follow, superb examples)
Scala推荐 Coursera: functional programming in scala - 超级好
Basic Engineering: https://see.stanford.edu/CourseIt also has great content on optimization, which is harder to find elsewhere.
If your want to be a DS for IT firms, then Maybe:
jquery/ajax (start from codecademy very simple js and jquery intro, then find books) w3c school one is also really good.
-----------------------------------------------------. 1point 3 acres
web services get basic idea of how browsers work (udacity - Website Performance Optimization) udacity web development (build a blog) (40 hours)
-----------------------------------------------------
SE ..
Software Development Life Cycles (udacity, mostly videos, as a quick intro only), amazingly, this one filled lots of holes in my knowledge base. Highly recommend
Also a book is mentioned here, worth a quick flip through, unfortunately, no ebook that I found works. Martin Fowler, Kent Beck, John Brant, William Opdyke, Don Roberts-Refactoring_ Improving the Design of Existing Code
-- this is helpful not only for working in IT, but helps overall coding style/efficiency as well. Wished I'd known earlier.
-----------------------------------------------------
Linux
Many servers are in linux. at least familiarize yourself with the command line stuff. There's a not so good course on Edx.
Basic shell script or similar
jq, sed, awk
----------------------------------------------------- 综合/分析/表述/软技能. 1point 3acres
软技能难以表述,
技巧不是最重要,想清楚再开口才是关键。突然发现我导师的lab页面竟然是用这些问题开头,深感心有戚戚。
化繁为简,高屋建瓴的表达能力:hide complex formula/engineering details,尽量传达big picture
个人经验是,习得这些能力最好的办法是:去讲,不要自顾自的讲话,请随时关注听众是否听懂,鼓励对方马上提问,回答问题要选取符合对方背景的关键字,而不是“自己熟悉”的关键字。不要用缩写,小范围术语。多讲清楚intuition,少堆积公式。
1. 教一门自己专业的入门课,e.g 统计学生,去给其他专业的人讲入门统计,例子:请给完全不懂统计的人讲,什么是pvalue, power, false positive, randomization, inference etc.
2. Consulting - 有些学校会有这种session,别觉得浪费时间,去把别人讲懂,去看看别人用你的专业技术做什么问题,他们的思路跟你哪里不同,你如何理解他们,如何让他们理解你。
3. 做presentation - 不要像专业学术会议上那样去讲,要向给别人上101课那样讲。讲的目的,不是展示你的专业多么复杂深奥,不是为了impress others with your techinal prowess,而是让对方懂,最终听取你的建议。
Data Journalism (course, starting early 2014) --- it was not as good as I expected. I do not recommend it.
作图,静态的最好能会ggplot (a few hours), 动态的d3,如果你会javascript, also great!, 推荐读
Nathan Yau: books visualize this & Data points, and his flowing data blog
for d3: Interactive Data Visualization for the Web . free online tutorial by author: http://alignedleft.com/tutorials/d3/about 真的没那么难. 1point3acres
作图是否好看并不是关键所在,选用合适的图标来帮助解释道理才比较重要
html (a few hours, w3c)
css (a few hours, w3c), or codecademy, or the d3 book mentioned above
javascript (codecademy as a start, a book to follow later)
Rcharts/highcharts
Udacity现在也有一门新开的vis课了.
Prototype your data products:
mean stack. https://thinkster.io/angulartutorial/mean-stack-tutorial/
起码把AngularJS学了,这个不光做数科有用。
R open CPU. R Shiny (limited usage with free version). If you are not into Angular, try the flask+React stack, 上手的确很快. 1point 3acres
(关于flask, udacity有课,react自学即可,可以参考udacity 关于components的课). 1point3acres.com
. 1point 3 acres
虽然我们不是要做前段开发,但是看起来也得至少有个半吊子前段,请学习这MM的经验,超赞 http://www.1point3acres.com/bbs/thread-104335-1-1.html Design: (optional but nice to know) 如果没有兴趣请至少看(组合起来好看的颜色) 如果你有兴趣让图好看,请花一个周末翻看这几本:
1. Before and After
2. Nondesigner's design book
3. Don't make me think
4. The Wall Street Journal Guide to Information Graphics
Research/publication:
sharelatex (invite enough users to get free versioning) /writelatex.com
Go to conferences, see what people are working on. Read their papers.
如果你想找某些类型的工作,上linkedin找到组员,泛读他们的paper
Domain Knowledge: google/wikipedia is your friend
========================================= 整体思路:
Doing Data science (book)
Data Science in Business
========================================= other 一些我感觉不太费时间但是会有用的小东西
excel, power pivot etc. 1point 3 acres
科普类的书:(都很简单易读)
CS189 or equivalent is a prerequisite for the course. This course will assume some familiarity with reinforcement learning, numerical optimization and machine learning. Students who are not familiar with the concepts below are encouraged to brush up using the references provided right below this list. We’ll review this material in class, but it will be rather cursory.
Reinforcement learning and MDPs
Definition of MDPs
Exact algorithms: policy and value iteration
Search algorithms
Numerical Optimization
gradient descent, stochastic gradient descent
backpropagation algorithm
Machine Learning
Classification and regression problems: what loss functions are used, how to fit linear and nonlinear models
已经pivot away from DS 之 data infra/data system
下阶段看的一个话题已经不是data science /analytics范畴了,但是也不是随便一个SWE就知道的范畴,因为主要还是跟数据相关,仍然放在这个版了,这个贴跟之前发的都不一样,不建议DS的人盲跟。如果你属于业务导向,business导向的,完全不需要看这些。analyst也完全不需要看。=================.1point3acres
Why?.
在DS、DA这个领域玩了好一阵子以后,感觉除了DL其他基本上可以做到融会贯通了,希望技术上有更宽的发展,因为感觉更深的深挖,好像能跳槽的地方就一只手能数过来并且还不一定想去了。If I dig any deeper, I may be cornering myself into a narrow niche with few patrons.
Also I have always been interested in seeing connections between things.
具体应用上,最大的感受是,常常方法并不一定需要多深,真正的hurdle 常常是一地鸡毛的事情,比如上游数据质量,速度,下游数据应用在业务上的场景,速度(?)什么的。 . From 1point 3acres bbs学学周围码农在干什么,有助于合作,有助于参与数据系统设计,有助于设计别人没有想到的系统和使用场景 -- 周围懂数据和业务使用的人,他们经常完全不懂系统和实现,而懂系统和实现的人,可能没有谁比我更懂数据的了。。。。中间这个gap也影响到一些做事方式。To bridge the gap 应该是很有趣的。
最近的一个项目愣是找econ, 统计,cs, ee, OR, ML 的人都问了一圈,同一些基础的东西,在各个行当有不同的叫法和应用,有趣的要命。
2. data infra/ data sys /data eng/ building data products
也是神书+coursera上一些课也许可以evaluate
更多是实践,下面一年中肯定会做这些工作
. .и 3. deep learning
http://www.1point3acres.com/bbs/thread-200846-1-1.html
Andrew Ng的 intro DL 尤其是course 3, 4, 5 仍然是很值得上的好东西
下面就是打算手工弄下pytorch with fast.ai
肯定能在工作里面找到好应用的,到时候贴出来
Big data 的书其实非常多,但是多到简直感觉看不过来,Hadoop的大象书其实当年也看过,一直也没有觉得理解的多清楚
Spark的两本也都看过,也是感觉没有看的特别懂
Kafka Definitive Guide还没来得及看
感觉不懂这些系统的情况下试图设计ML系统就是个joke,会浪费无穷时间反复犯前人早就犯过错. 1point 3 acres
..
如果还有空的话,
Intermediate: Coursera Cloud computing (5-10hr/week X 5 week X 5 courses, + 2 capstone projects)
Course 1 covers a ton of classic concepts that I see day in and day out without grokking
(C++)
现在网上MOOC关于Data Science的种类繁多, 选择太多往往无从下手,对于转行的同学总想在最短的时间内获得最容易理解而且实用的知识, 我最为其中之一深有体会,现在就结合我自己的经历说下自己在这方面的心得。
在我看来, Data Science/Analytics 大致需要掌握以下几方面的技能:
1. SQL, 数据库相关的技能
这个是所有从事数据分析的第一步:获取数据,而绝大部分的数据储存在数据库中,所以SQL的技能很关键,事实上也是以后也会占用你工作的大部分时间。
SQL不难,但是想要快速熟练的掌握光靠背几个 select, from, where, group by 是远远不够的,最好的联系方法是能一边写一边看得出的结果,从而搞清楚每条语句实际在背后对数据做了什么操作,逻辑是什么。.--
SQL也是数据分析面试时重点考察的方面,Google, Facebook, Uber, Slack等等这些大的科技公司都会去着重考察,不需要你会很fancy的命令语句,但是会让你利用简单的命令语句去实现很复杂的逻辑关系, 这方面的资源比较入门级的有 SQLZOO 和 W3 School的SQL部分,这两个相对来说好快速上手,而且都是我前面说的可以让你一边写SQL一边看你query出来的结果, 这样会让你对命令语句具体对数据本身做了什么。
进阶的资源有微软在edx上的一门MOOC:Querying with Transact-SQL, 这门课也适用于初学者,不过学习的时间要长一些,因为内容会讲的深一些(比如window function 和 table expression)
2. 统计的基本原理
大部分传统的机器学习的算法来自于统计学,而且统计学的知识也被大量的用在了数据探索阶段(Explanatory Data Analysis) 和工作中各种各样的Statistical Testing上面
这方面就是传统的统计知识,尽量选一些名牌大学的通俗易懂的基础统计课即可。
3. Data Science/Machine Learning Modeling . 1point 3 acres
这块的课程最多,但也最难选,因为很多课程要么太注重理论,需要有很好的数学基础才能理解,要么就是相对来说太过简单,下面是我觉得蛮好的课程,兼顾了理论深度,理解难度和实践程度。
Udemy: Python for Data Science and Machine Learning Bootcamp
如果你是个对理论数学化的东西不大感兴趣,只注重怎么把ML的算法应用在实际中,那这门课是很好的入门课。
这门来自MIT的神课介绍每个算法时都是通过一个相应的现实中真实应用的案例来讲的,而且讲的通俗易懂,全部课程的语言为,也很容易上手
Udacity: Intro to Machine Learning. 1point3acres
这门课是Google X 实验室的创始人 Sebastian Thrun (同时也是Udacity的创始人)讲授的,全面的涵盖了主流ML的算法,中间每讲一个新的算法,都会穿插了很多小练习帮助你巩固新学到的知识,而且Sebastian作为业界大牛,对ML的讲解也很清晰直白易懂。
Stanford Online: Statistical Learning
the difference between file system on linux and on hdfs!!!!. 1point 3 acres
. check 1point3acres for more.
even there's a local file directory called data, still need to create one on hdfs:
hadoop fs -mkdir data
. .и
hadoop fs -ls
Found 1 item
drwxr-xr-x - training supergroup 0 2013-12-11 17:16 data
then there's a HDFS folder called data.
..
now put the actual data into HDFS:
hadoop fs -put purchase.txt data (1st purchase.txt is the file on your local Filesystem, 2nd data is HDFS folder). 1point 3acres
then you can check you do have this:
hadoop fs -ls data
Found 1 items
-rw-r--r-- 1 training supergroup 211312924 2013-12-11 17:17 data/purchases.txt. .и
Run:
hs ../code/mapper.py ../code/reducer_f2.py data/purchases.txt outdata2
packageJobJar: [../code/mapper.py, ../code/reducer_f2.py, /tmp/hadoop-training/hadoop-unjar8573115774818496995/] [] /tmp/streamjob8981780528938293292.jar tmpDir=null
13/12/11 17:33:16 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/12/11 17:33:17 WARN snappy.LoadSnappy: Snappy native library is available
13/12/11 17:33:17 INFO snappy.LoadSnappy: Snappy native library loaded
13/12/11 17:33:17 INFO mapred.FileInputFormat: Total input paths to process : 1
13/12/11 17:33:17 INFO streaming.StreamJob: getLocalDirs(): [/var/lib/hadoop-hdfs/cache/training/mapred/local]
13/12/11 17:33:17 INFO streaming.StreamJob: Running job: job_201312111650_0004
13/12/11 17:33:17 INFO streaming.StreamJob: To kill this job, run:
13/12/11 17:33:17 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201312111650_0004
13/12/11 17:33:17 INFO streaming.StreamJob: Tracking URL: http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201312111650_0004
13/12/11 17:33:18 INFO streaming.StreamJob: map 0% reduce 0% ..
13/12/11 17:33:30 INFO streaming.StreamJob: map 12% reduce 0%
13/12/11 17:33:33 INFO streaming.StreamJob: map 19% reduce 0%
13/12/11 17:33:36 INFO streaming.StreamJob: map 26% reduce 0%
13/12/11 17:33:40 INFO streaming.StreamJob: map 32% reduce 0%.google и
13/12/11 17:33:43 INFO streaming.StreamJob: map 40% reduce 0%. ----
13/12/11 17:33:46 INFO streaming.StreamJob: map 47% reduce 0%. ----
13/12/11 17:33:49 INFO streaming.StreamJob: map 50% reduce 0%
13/12/11 17:34:01 INFO streaming.StreamJob: map 75% reduce 0%
13/12/11 17:34:02 INFO streaming.StreamJob: map 81% reduce 17%
13/12/11 17:34:05 INFO streaming.StreamJob: map 88% reduce 17%. 1point3acres.com
13/12/11 17:34:08 INFO streaming.StreamJob: map 95% reduce 25%
13/12/11 17:34:11 INFO streaming.StreamJob: map 100% reduce 25%
13/12/11 17:34:17 INFO streaming.StreamJob: map 100% reduce 69% . .и13/12/11 17:34:20 INFO streaming.StreamJob: map 100% reduce 75% ..
. From 1point 3acres bbs
hadoop fs -cat outdata1/part-00000
Baby 57491808.44. 1point 3acres
Books 57450757.91 ..
get data out from HDFS to local
ls.1point3acres
code data . ----[training@localhost udacity_training]$ mkdir outdata
[training@localhost udacity_training]$ cd outdata/ ..
[training@localhost outdata]$ hadoop fs -get outdata2a/part-00000 . From 1point 3acres bbs[training@localhost outdata]$ ls .. part-00000.google и
..
For quick tests, make some sample data (I just copied 20 lines from ~/udacity_training/data/purchases.txt). Save it as sampleData.txt in your code directory.
head -40 purchase.txt > sample.txt
Then in a terminal, in the code directory, you can run
./mapper.py <sampleData.txt >mappedData.txt
. 1point 3 acres
and then
./reducer.py <mappedData.txt.--
for me it's more like this:
python ./mapper_f3a.py <../data/sample.txt >../data/mappedData.txt
python ./reducer_f3a.py <../data/mappedData.txt