Project Interview.....come to share opinions?

2014(4-6月) 分析|数据科学类 本科 实习@ - 网上海投 - 其他 |Other


之前在编程算法板块发过帖子求助,刚才随便做了做交了,顺便发出来大家一起讨论。。。。因为输入法比较 原始 我就用英语了。。

The HR sent me a project to do, in order to evaluate my skills on the knowledge of Data Science. She asked me not to post the project online, so I will just describe it generally, for you guys to discuss...
There is a 1GB data set, the first column is facebook user ID, the rest columns are their 'likes':basket ball, base ball, walking dead, domino.... just imaging you are on you facebook and you want to click 'like' for some public page, of someone's words, for expample, I posted "I get a girl friend", and you will click 'like' my post...and different people have different number of likes and the likes are not all the same. The data may looks like:
1234455 basketball, olive garden,walking dead, i get a girl frend,really?you love me?,haha haha haha,....
1234667 data science, cs, statistics, weight training, dell, dota2,.......
there are 5w obs, the number of columns are not the same for each obs, and besides english, there are japnese, and many of other characters will my computer can't show them....-google 1point3acres
1.built a histgram for the likes
In my opinion, it's just count the frenquecy of the different strings, and maybe we can ignore some low frenquency ones. The challenge for me is read the data (cause I only know R, I need ways to handle the memory), and count the frenquency(cause I don't know how many kinds of strings I will have). 鍥磋鎴戜滑@1point 3 acres
2.built a histgram for the like pairs, eg:(stats,cs), if there are 100 likes for a person, there will be possible 100*99/2 pairs...-google 1point3acres
3. Given the training set, if you have the likes for a new person, build a recommendation system, recommend some thing to the person
4.Given the training set, for some thing, like 'cs', find the people you should recommend it to.. 鍥磋鎴戜滑@1point 3 acres

This is a project interview from a start up, I just felt like it's not the right time for me to be on that position, because I really don't have any good solutions, and I have finals and projects due at the end of this month, so I roughly send my ideas to the HR.

Hope we could have a good discuss about it, and wish it may help some one with future interviews...

ps: the data above are fake, manipulated by myself





just opened the data set, 26W rows, 70 cols

ignore this's not correct. The data is at a mess and I didn't read it correctly.
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. use hashmap or trie trees. The memory is not an issue on the production machine, as those machines have tons of memories . If they really care about memory, use some key-value db and persist the data in the disk/flash. You might need to think about how to process different language characters... and also some words might be different but they may mean the same thing in different languages.

2.Not sure if I understand your questions correctly, do you need to consider the case that like pairs cross person? Or you only need to consider the like pairs within a person?

3. just some basic idea, maybe maintain an like to user id set mapping. So for any given two likes, if there is huge overlap between their id set, you can consider those two likes are close, and then you can recommend similar likes.

4. If you have 3 , should be trivial to do 4
danielgao 发表于 2014-4-10 00:37
I guess the interview is from FB, and it's probably not a good idea to posted it online....

1. us ...
It's from a startup, not FB..
I don't know hash map, or hash table, any quick ways to learn and apply?

The second task is, I think, find the cross pairs for each person, say for me: a b c, then I need find out the frequency  of pair (ab) among all the people.
这题很适合用 Spark 啊。
