在国外一跟老外吵架口语立刻就不够用了

一亩三分地论坛

 找回密码
 Sign Up 注册获取更多干货
E轮2.5亿美元融资
K12教育独角兽一起作业
北京-诚聘人工智能/教育/大数据岗
坐标湾区
Games Startup
招聘游戏开发工程师
游戏初创公司招聘工程师、UIUX Designer和游戏策划
码农求职神器Triplebyte:
不用海投
内推多家公司面试
把贵司招聘信息放这里
查看: 1636|回复: 6
收起左侧

Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?Sign Up 注册获取更多干货

x
本帖最后由 DL 于 2017-6-10 23:51 编辑
.本文原创自1point3acres论坛
贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen.留学论坛-一亩-三分地
  3. from urllib.error import HTTPError
  4. import re. From 1point 3acres bbs
  5. import string. 1point 3acres 论坛
  6. from bs4 import BeautifulSoup
  7. import pymysql
  8. import nltk
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server

  11. stop_words = set(stopwords.words('english'))
  12. stop_words.update(set(string.punctuation))
  13. # Open database connection
  14. db = pymysql.connect("localhost", user = "testuser", passwd = "******", . more info on 1point3acres
  15.                      db="jobs", charset = 'utf8')
  16. # prepare a cursor object using cursor() method-google 1point3acres
  17. cursor = db.cursor()
  18. . From 1point 3acres bbs
  19. # execute SQL query using execute() method.
  20. #cursor.execute('''DROP TABLE IF EXISTS kaggle''')
  21. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  22.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  23.     Location VARCHAR(150), Contents VARCHAR(3000),
  24.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50),
  25.     PRIMARY KEY(id)) ''')

  26. cursor.execute(''' SELECT id FROM kaggle ''')
  27. data = cursor.fetchall(). visit 1point3acres for more.
  28. jobid_visited = set([d[0] for d in data])

  29. jobId_start = 16800
  30. jobId_end = 17900. 一亩-三分-地,独家发布

  31. home_url = "https://www.kaggle.com/jobs/"

  32. for job_id in range(jobId_start, jobId_end):
  33.     if job_id % 50 == 0:
  34.         print(job_id).1point3acres网
  35.         sleep(1)
  36.     if job_id in jobid_visited:
  37.         continue.本文原创自1point3acres论坛
  38.     job_link = "%s%s" % (home_url, job_id)
  39.     try:
  40.         html = urlopen(job_link)
  41.     except HTTPError as e:. 1point3acres
  42.         print(job_id, e)
  43.         continue
  44.     soup = BeautifulSoup(html.read(), "lxml")
  45.     job_title = soup.find('div', attrs={'class':'title'})
  46.     title = job_title.h1.getText(). 围观我们@1point 3 acres
  47.     company = job_title.h2.getText()
  48.     location = job_title.h3.getText(). Waral 博客有更多文章,
  49.     submission = soup.find('p', attrs={'class':'submission-date'})
  50.     submission_date = submission.span['title'].split()[0]. 1point 3acres 论坛
  51.     reviews = submission.contents[2]
  52.     reviews = int(''.join(list(filter(str.isdigit, reviews))))
  53.    
  54.     contents = submission.next_siblings # job descriptions tag-google 1point3acres
  55.     word_set = set()
  56.     for para in contents:
  57.         if para == '\n':
  58.             continue      
  59. . 牛人云集,一亩三分地
  60.         if isinstance(para, str):
  61.             text = para.strip()
  62.         else:
  63.             text = para.get_text().strip()
  64.             
  65.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII
  66.         words = nltk.tokenize.word_tokenize(text.lower())
  67.         words = [w for w in words if ((not w in stop_words) \
  68.                     and (not "'" in w))]
  69.         word_set.update(words). more info on 1point3acres
  70.                                             
  71.     text = ' '.join(word_set)
  72.     if len(text) > 3000:
  73.         text = text[:3000]
  74.    
  75.     company = re.sub(r'[\'\"]', r' ', company)
  76.     location = re.sub(r'[\'\"]', r' ', location)
  77.     title = re.sub(r'[\'\"]', r' ', title)
  78.     . visit 1point3acres for more.
  79.     if len(company) > 150:
  80.         company = company[:150]
  81.         
  82.     if len(title) > 150:
  83.         title = title[:150]
  84.         
  85.     if len(location) > 150:
  86.         location = location[:150]
  87.    
  88.     sql = '''INSERT INTO kaggle
  89.             (id, Company, Title, Location, Contents, Created, Reviews, Link)
  90.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\
  91.             (job_id, company, title, location, text,
  92.              submission_date, reviews, job_link)
    . 1point 3acres 论坛
  93.     try:
  94.         cursor.execute(sql)
  95.         cursor.connection.commit()
  96.     except Exception as e:
  97.         print(job_id, e) 来源一亩.三分地论坛.
  98.         break

  99. # Fetch a single row using fetchone() method.
  100. cursor.execute(''' SELECT * FROM kaggle ''')
  101. data = cursor.fetchone()
  102. print (data) 来源一亩.三分地论坛.
  103.    
  104. # disconnect from server
  105. cursor.close()
  106. db.close(). 1point3acres

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*. 1point3acres
复制代码
indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/. Waral 博客有更多文章,
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/

主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin-google 1point3acres
3. https://jessesw.com/Data-Science-Skills/

评分

4

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
web scraping careerbuilder
  1. # -*- coding: utf-8 -*- 来源一亩.三分地论坛.
  2. import sys. 一亩-三分-地,独家发布
  3. from urllib.request import urlopen. 牛人云集,一亩三分地
  4. from urllib.error import HTTPError
  5. import re
  6. import datetime
  7. import string
  8. from bs4 import BeautifulSoup
  9. import pymysql
  10. import nltk
  11. from nltk.corpus import stopwords
  12. from time import sleep # To prevent overwhelming the server

  13. date_today = datetime.date.today()

  14. stop_words = set(stopwords.words('english'))
  15. stop_words.update(set(string.punctuation)). From 1point 3acres bbs

  16. # Open database connection
  17. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  18.                      db="jobs", charset = 'utf8')
  19. # prepare a cursor object using cursor() method
  20. cursor = db.cursor()

  21. # execute SQL query using execute() method.
  22. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''')
  23. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  
  24.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  25.     Location VARCHAR(150), Contents VARCHAR(3000),
  26.     Industry VARCHAR(150), Category VARCHAR(150),
  27.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE,
  28.     PRIMARY KEY(id)) ''')

  29. cursor.execute(''' SELECT id FROM careerBuilder ''')
  30. data = cursor.fetchall()
  31. jobid_visited = set([d[0] for d in data]). visit 1point3acres for more.

  32. ds_url = "https://www.careerbuilder.com/jobs-data-scientist"
  33. sort_by = "date_desc"
  34. page = 1
  35. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by).留学论坛-一亩-三分地
  36. try:. 牛人云集,一亩三分地
  37.     html = urlopen(url)
  38. except HTTPError as e:
  39.     print(e)
  40.     sys.exit()

  41. soup = BeautifulSoup(html.read(), "lxml")
  42. . 1point3acres
  43. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text()
  44. num_pages = int(re.findall(r'\d+', num_pages)[1]). 牛人云集,一亩三分地

  45. base_url = "https://www.careerbuilder.com".本文原创自1point3acres论坛
  46. #for page in range(1, num_pages + 1):
  47. for page in range(1, num_pages + 1):
  48.     print('start page: ', page)
  49.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  50.     try:
  51.         html = urlopen(url)
  52.     except HTTPError as e:
  53.         print(page, e)
  54.         break

  55.     soup = BeautifulSoup(html.read(), "lxml")
  56.     jobs = soup.findAll('h2', attrs={'class':'job-title'})
  57.     for job in jobs:
  58.         # retriev job id. visit 1point3acres for more.
  59.         job_did = job.a.get('data-job-did')
  60.         if job_did in jobid_visited:
  61.             continue
  62.         
  63.         job_link = base_url + job.a.get('href')        
  64.         try:
  65.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  66.         except Exception as e:
  67.             print(page, job_did, e)
  68.             continue
  69.             
  70.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'})
    . 牛人云集,一亩三分地
  71.         title = job_detail.h1.get_text().strip()
  72.         company_location = job_detail.h2.get_text().strip().split('\n')
  73.         if len(company_location) == 1:
  74.             location = company_location[0]
  75.         elif len(company_location) == 3:
  76.             company, location = company_location[0], company_location[2]
  77.         
  78.         # retriev job posted date
  79.         begin_date = job_detail.h3.get_text().strip()
  80.         begin_date = re.findall('(\d+) day', begin_date)
  81.         if len(begin_date) == 0 :. 留学申请论坛-一亩三分地
  82.             time_delta = 0
  83.         else :
  84.             time_delta = int(begin_date[0])
  85.         
  86.         begin_date = date_today - datetime.timedelta(days=time_delta).留学论坛-一亩-三分地
  87.         begin_date = begin_date.strftime("%y/%m/%d")
  88.         
  89.         # retriev job category. 牛人云集,一亩三分地
  90.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'})
  91.         
  92.         job_industry = snapshot.find('div', id='job-industry'). from: 1point3acres
  93.         if job_industry:
  94.             job_industry = job_industry.get_text().strip()
  95.         
  96.         job_category = snapshot.find('div', id='job-categories')
  97.         if job_category:
  98.             job_category = job_category.get_text().strip(). 围观我们@1point 3 acres
  99.         
  100.         # find anually salary if available
  101.         salary = ''
  102.         for line in snapshot.get_text().splitlines():. 牛人云集,一亩三分地
  103.             salary_entry = re.findall(r'^(\$.*)/Year


  104. , line)
  105.             if len(salary_entry) == 1:
  106.                 salary = salary_entry[0].strip()
  107.                
  108.         job_id = job_detail.find('div', class_='small-12 columns job-id').本文原创自1point3acres论坛
  109.         if job_id:-google 1point3acres
  110.             job_id = job_id.get_text().strip().split('\n')[1]
  111.         
  112.         # get job description and requirements
  113.         job_item = job_detail.find('div', class_='small-12 columns item'). 一亩-三分-地,独家发布
  114.         text = job_item.get_text(). 留学申请论坛-一亩三分地
  115.         # break into lines
  116.         lines = (line.strip() for line in text.splitlines())
  117.         # break multi-headlines into a line each. 1point3acres
  118.         chunks = (phrase.strip() for line in lines
  119.             for phrase in line.split(" ")).本文原创自1point3acres论坛
  120.    
  121.         text = ' '.join(chunk for chunk in chunks if chunk).本文原创自1point3acres论坛
  122.         text = re.sub(r"[\'\-\/&]", r' ', text)
  123.         words = nltk.tokenize.word_tokenize(text.lower()) 来源一亩.三分地论坛.
  124.         words = [w for w in words if (not w in stop_words)]
  125.         word_set = set(words)
  126.         . 一亩-三分-地,独家发布
  127.         text = ' '.join(word_set)
  128. . from: 1point3acres
  129.         if len(text) > 3000:
  130.             text = text[:3000]
  131.         
  132.         company = re.sub(r'[\'\"]', r' ', company)
  133.         location = re.sub(r'[\'\"]', r' ', location)
  134.         title = re.sub(r'[\'\"]', r' ', title)
  135.         
  136.         if len(company) > 150:
  137.             company = company[:150]
  138.             
  139.         if len(title) > 150:. more info on 1point3acres
  140.             title = title[:150]
  141.             
  142.         if len(location) > 150:. Waral 博客有更多文章,
  143.             location = location[:150]
  144.             .1point3acres网
  145. #        print(job_did, title, company, location, salary, begin_date,
  146. #              job_industry, job_category, job_link, text)
  147. #        print("------------------------------------------")
  148.         
  149.         sql = '''INSERT INTO careerBuilder
  150.                 (id, Company, Title, Location, Contents, Industry,
  151.                 Category, Link, Salary, Created)
  152.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s',
  153.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\
  154.                 (job_did, company, title, location, text,
  155.                   job_industry, job_category,
  156.                  job_link, salary, begin_date)
  157.         try:
  158.             cursor.execute(sql)
  159.             cursor.connection.commit()
  160.         except Exception as e:
  161.             print(job_id, e)
  162.             break
  163.    
  164.     sleep(1)

  165. # Fetch a single row using fetchone() method.
  166. cursor.execute(''' SELECT * FROM careerBuilder ''')
  167. data = cursor.fetchone().留学论坛-一亩-三分地
  168. print (data)
  169.    
  170. # disconnect from server
  171. cursor.close(). 留学申请论坛-一亩三分地
  172. db.close()
  173. . Waral 博客有更多文章,
复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
本帖最后由 DL 于 2017-6-11 00:35 编辑

kaggle上jobs的关键技能排序 和 educatation 要求

  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。

kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求
.1point3acres网

careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序

careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-23 10:43:49 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的

indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

提醒:发帖可以选择内容隐藏,部分板块支持匿名发帖。请认真读完以下全部说明:

■隐藏内容方法: [hide=200]你想要隐藏的内容比如面经[/hide]
■意思是:用户积分低于200则看不到被隐藏的内容
■可以自行设置积分值,不建议太高(200以上太多人看不到),也不建议太低(那就没必要隐藏了)
■建议只隐藏关键内容,比如具体的面试题目、涉及隐私的信息,大部分内容没必要隐藏。
■微信/QQ/电子邮件等,为防止将来被骚扰甚至人肉,以论坛私信方式发给对方最安全。
■匿名发帖的板块和方法:http://www.1point3acres.com/bbs/thread-405991-1-1.html

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2018-5-23 07:51

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表