一亩三分地论坛

 找回密码
 获取更多干货,去instant注册!

扫码关注一亩三分地公众号
查看: 567|回复: 4
收起左侧

Project Interview.....come to share opinions?

[复制链接] |试试Instant~ |关注本帖
snowdustdj 发表于 2014-4-9 06:47:11 | 显示全部楼层 |阅读模式

2014(4-6月) 分析|数据科学类 本科 实习@ - 网上海投 - 其他 |Other

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?获取更多干货,去instant注册!

x
之前在编程算法板块发过帖子求助,刚才随便做了做交了,顺便发出来大家一起讨论。。。。因为输入法比较 原始 我就用英语了。。

The HR sent me a project to do, in order to evaluate my skills on the knowledge of Data Science. She asked me not to post the project online, so I will just describe it generally, for you guys to discuss....1point3acres缃
There is a 1GB data set, the first column is facebook user ID, the rest columns are their 'likes':basket ball, base ball, walking dead, domino.... just imaging you are on you facebook and you want to click 'like' for some public page, of someone's words, for expample, I posted "I get a girl friend", and you will click 'like' my post...and different people have different number of likes and the likes are not all the same. The data may looks like:
1234455 basketball, olive garden,walking dead, i get a girl frend,really?you love me?,haha haha haha,....
1234667 data science, cs, statistics, weight training, dell, dota2,.......
.......
there are 5w obs, the number of columns are not the same for each obs, and besides english, there are japnese, and many of other characters will my computer can't show them..... 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
object:
1.built a histgram for the likes
In my opinion, it's just count the frenquecy of the different strings, and maybe we can ignore some low frenquency ones. The challenge for me is read the data (cause I only know R, I need ways to handle the memory), and count the frenquency(cause I don't know how many kinds of strings I will have)
2.built a histgram for the like pairs, eg:(stats,cs), if there are 100 likes for a person, there will be possible 100*99/2 pairs...
3. Given the training set, if you have the likes for a new person, build a recommendation system, recommend some thing to the person
4.Given the training set, for some thing, like 'cs', find the people you should recommend it to.. 1point 3acres 璁哄潧

This is a project interview from a start up, I just felt like it's not the right time for me to be on that position, because I really don't have any good solutions, and I have finals and projects due at the end of this month, so I roughly send my ideas to the HR.

Hope we could have a good discuss about it, and wish it may help some one with future interviews...

ps: the data above are fake, manipulated by myself

评分

2

查看全部评分

本帖被以下淘专辑推荐:

 楼主| snowdustdj 发表于 2014-4-9 06:50:03 | 显示全部楼层
just opened the data set, 26W rows, 70 cols
. more info on 1point3acres.com
补充内容 (2014-4-9 06:53):
ignore this one....it's not correct. The data is at a mess and I didn't read it correctly.
回复 支持 反对

使用道具 举报

danielgao 发表于 2014-4-10 00:37:16 | 显示全部楼层
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. use hashmap or trie trees. The memory is not an issue on the production machine, as those machines have tons of memories . If they really care about memory, use some key-value db and persist the data in the disk/flash. You might need to think about how to process different language characters... and also some words might be different but they may mean the same thing in different languages.

2.Not sure if I understand your questions correctly, do you need to consider the case that like pairs cross person? Or you only need to consider the like pairs within a person?

3. just some basic idea, maybe maintain an like to user id set mapping. So for any given two likes, if there is huge overlap between their id set, you can consider those two likes are close, and then you can recommend similar likes.. From 1point 3acres bbs
.鐣欏璁哄潧-涓浜-涓夊垎鍦
4. If you have 3 , should be trivial to do 4
回复 支持 反对

使用道具 举报

 楼主| snowdustdj 发表于 2014-4-10 02:24:33 来自手机 | 显示全部楼层
danielgao 发表于 2014-4-10 00:37
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. us ...
. From 1point 3acres bbs
It's from a startup, not FB..
I don't know hash map, or hash table, any quick ways to learn and apply?

The second task is, I think, find the cross pairs for each person, say for me: a b c, then I need find out the frequency  of pair (ab) among all the people.
回复 支持 反对

使用道具 举报

阿骄 发表于 2015-11-21 11:20:30 | 显示全部楼层
这题很适合用 Spark 啊。
回复 支持 反对

使用道具 举报

本版积分规则

请点这里访问我们的新网站:一亩三分地Instant.

Instant搜索更强大,不扣积分,内容组织的更好更整洁!目前仍在beta版本,努力完善中!反馈请点这里

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明 ( 沪ICP备11015994号 )

custom counter

GMT+8, 2016-12-7 02:54

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表