May 2009 - May 2017 论坛八周年-你的足迹,我的骄傲



查看: 821|回复: 4

Project Interview.....come to share opinions?

[复制链接] |试试Instant~ |关注本帖
snowdustdj 发表于 2014-4-9 06:47:11 | 显示全部楼层 |阅读模式

2014(4-6月) 分析|数据科学类 本科 实习@ - 网上海投 - 其他 |Other


您需要 登录 才可以下载或查看,没有帐号?获取更多干活,快来注册

之前在编程算法板块发过帖子求助,刚才随便做了做交了,顺便发出来大家一起讨论。。。。因为输入法比较 原始 我就用英语了。。.1point3acres缃

The HR sent me a project to do, in order to evaluate my skills on the knowledge of Data Science. She asked me not to post the project online, so I will just describe it generally, for you guys to discuss...
There is a 1GB data set, the first column is facebook user ID, the rest columns are their 'likes':basket ball, base ball, walking dead, domino.... just imaging you are on you facebook and you want to click 'like' for some public page, of someone's words, for expample, I posted "I get a girl friend", and you will click 'like' my post...and different people have different number of likes and the likes are not all the same. The data may looks like:
1234455 basketball, olive garden,walking dead, i get a girl frend,really?you love me?,haha haha haha,....
1234667 data science, cs, statistics, weight training, dell, dota2,.......
there are 5w obs, the number of columns are not the same for each obs, and besides english, there are japnese, and many of other characters will my computer can't show them....-google 1point3acres
1.built a histgram for the likes
In my opinion, it's just count the frenquecy of the different strings, and maybe we can ignore some low frenquency ones. The challenge for me is read the data (cause I only know R, I need ways to handle the memory), and count the frenquency(cause I don't know how many kinds of strings I will have). 鍥磋鎴戜滑@1point 3 acres
2.built a histgram for the like pairs, eg:(stats,cs), if there are 100 likes for a person, there will be possible 100*99/2 pairs...-google 1point3acres
3. Given the training set, if you have the likes for a new person, build a recommendation system, recommend some thing to the person
4.Given the training set, for some thing, like 'cs', find the people you should recommend it to.. 鍥磋鎴戜滑@1point 3 acres
. 鍥磋鎴戜滑@1point 3 acres
This is a project interview from a start up, I just felt like it's not the right time for me to be on that position, because I really don't have any good solutions, and I have finals and projects due at the end of this month, so I roughly send my ideas to the HR.

Hope we could have a good discuss about it, and wish it may help some one with future interviews...

ps: the data above are fake, manipulated by myself .鏈枃鍘熷垱鑷1point3acres璁哄潧





 楼主| snowdustdj 发表于 2014-4-9 06:50:03 | 显示全部楼层
just opened the data set, 26W rows, 70 cols

补充内容 (2014-4-9 06:53):
ignore this's not correct. The data is at a mess and I didn't read it correctly.
回复 支持 反对

使用道具 举报

danielgao 发表于 2014-4-10 00:37:16 | 显示全部楼层
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. use hashmap or trie trees. The memory is not an issue on the production machine, as those machines have tons of memories . If they really care about memory, use some key-value db and persist the data in the disk/flash. You might need to think about how to process different language characters... and also some words might be different but they may mean the same thing in different languages.

2.Not sure if I understand your questions correctly, do you need to consider the case that like pairs cross person? Or you only need to consider the like pairs within a person?

3. just some basic idea, maybe maintain an like to user id set mapping. So for any given two likes, if there is huge overlap between their id set, you can consider those two likes are close, and then you can recommend similar likes.

4. If you have 3 , should be trivial to do 4
回复 支持 反对

使用道具 举报

 楼主| snowdustdj 发表于 2014-4-10 02:24:33 来自手机 | 显示全部楼层
danielgao 发表于 2014-4-10 00:37. Waral 鍗氬鏈夋洿澶氭枃绔,
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. us ...
It's from a startup, not FB..
I don't know hash map, or hash table, any quick ways to learn and apply?
. 1point 3acres 璁哄潧
The second task is, I think, find the cross pairs for each person, say for me: a b c, then I need find out the frequency  of pair (ab) among all the people.
回复 支持 反对

使用道具 举报

阿骄 发表于 2015-11-21 11:20:30 | 显示全部楼层
这题很适合用 Spark 啊。
回复 支持 反对

使用道具 举报



一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明 ( 沪ICP备11015994号 )

custom counter

GMT+8, 2017-5-27 18:20

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表