传说中的谷歌招聘委员会成员之一,从幕后走出来,教你学系统设计!


一亩三分地论坛

 找回密码
 获取更多干活,快来注册
天天打游戏、照样领工资、还办H1B
这份工作你要不要?
把贵司招聘信息放这里
查看: 1000|回复: 6
收起左侧

[DataScience] Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?获取更多干活,快来注册

x
本帖最后由 DL 于 2017-6-10 23:51 编辑 . visit 1point3acres.com for more.

贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen
  3. from urllib.error import HTTPError
  4. import re. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  5. import string
  6. from bs4 import BeautifulSoup
  7. import pymysql
  8. import nltk
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server . 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴

  11. stop_words = set(stopwords.words('english'))
  12. stop_words.update(set(string.punctuation))
  13. # Open database connection
  14. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  15.                      db="jobs", charset = 'utf8')
  16. # prepare a cursor object using cursor() method. more info on 1point3acres.com
  17. cursor = db.cursor()
  18. .鐣欏璁哄潧-涓浜-涓夊垎鍦
  19. # execute SQL query using execute() method.
  20. #cursor.execute('''DROP TABLE IF EXISTS kaggle''')
  21. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  22.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  23.     Location VARCHAR(150), Contents VARCHAR(3000),
  24.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50),
  25.     PRIMARY KEY(id)) '''). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴

  26. . 1point 3acres 璁哄潧
  27. cursor.execute(''' SELECT id FROM kaggle ''')
  28. data = cursor.fetchall()
  29. jobid_visited = set([d[0] for d in data])

  30. jobId_start = 16800
  31. jobId_end = 17900. 1point 3acres 璁哄潧

  32. home_url = "https://www.kaggle.com/jobs/"

  33. for job_id in range(jobId_start, jobId_end):
  34.     if job_id % 50 == 0:
  35.         print(job_id)
  36.         sleep(1)
  37.     if job_id in jobid_visited:
    . Waral 鍗氬鏈夋洿澶氭枃绔,
  38.         continue
  39.     job_link = "%s%s" % (home_url, job_id).鐣欏璁哄潧-涓浜-涓夊垎鍦
  40.     try:
  41.         html = urlopen(job_link)
  42.     except HTTPError as e:
  43.         print(job_id, e). 1point 3acres 璁哄潧
  44.         continue. visit 1point3acres.com for more.
  45.     soup = BeautifulSoup(html.read(), "lxml"). from: 1point3acres.com/bbs
  46.     job_title = soup.find('div', attrs={'class':'title'})
  47.     title = job_title.h1.getText().鐣欏璁哄潧-涓浜-涓夊垎鍦
  48.     company = job_title.h2.getText()
  49.     location = job_title.h3.getText()
  50.     submission = soup.find('p', attrs={'class':'submission-date'})
  51.     submission_date = submission.span['title'].split()[0]
  52.     reviews = submission.contents[2]
  53.     reviews = int(''.join(list(filter(str.isdigit, reviews)))). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  54.     . from: 1point3acres.com/bbs
  55.     contents = submission.next_siblings # job descriptions tag
  56.     word_set = set()
  57.     for para in contents:
  58.         if para == '\n':.1point3acres缃
  59.             continue      

  60.         if isinstance(para, str):
  61.             text = para.strip()
  62.         else:
  63.             text = para.get_text().strip()
  64.             .鐣欏璁哄潧-涓浜-涓夊垎鍦
  65.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII
  66.         words = nltk.tokenize.word_tokenize(text.lower()). Waral 鍗氬鏈夋洿澶氭枃绔,
  67.         words = [w for w in words if ((not w in stop_words) \
  68.                     and (not "'" in w))]
  69.         word_set.update(words)
  70.                                             
  71.     text = ' '.join(word_set)
  72.     if len(text) > 3000:
  73.         text = text[:3000]
  74.    
  75.     company = re.sub(r'[\'\"]', r' ', company)
  76.     location = re.sub(r'[\'\"]', r' ', location)
  77.     title = re.sub(r'[\'\"]', r' ', title)
  78.    
  79.     if len(company) > 150:
  80.         company = company[:150]
  81.         
  82.     if len(title) > 150:
  83.         title = title[:150]
  84.         
  85.     if len(location) > 150:
  86.         location = location[:150]
  87.    
  88.     sql = '''INSERT INTO kaggle
  89.             (id, Company, Title, Location, Contents, Created, Reviews, Link)
  90.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\
  91.             (job_id, company, title, location, text,
  92.              submission_date, reviews, job_link)
  93.     try:
  94.         cursor.execute(sql)
  95.         cursor.connection.commit(). more info on 1point3acres.com
  96.     except Exception as e:
  97.         print(job_id, e)
  98.         break

  99. # Fetch a single row using fetchone() method.
  100. cursor.execute(''' SELECT * FROM kaggle ''')
  101. data = cursor.fetchone()
  102. print (data) . from: 1point3acres.com/bbs
  103.    
  104. # disconnect from server
  105. cursor.close(). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  106. db.close()

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*. from: 1point3acres.com/bbs
复制代码
indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/-google 1point3acres
. 1point3acres.com/bbs
主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin
3. https://jessesw.com/Data-Science-Skills/
.1point3acres缃

评分

4

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
web scraping careerbuilder
  1. # -*- coding: utf-8 -*-
  2. import sys
  3. from urllib.request import urlopen
  4. from urllib.error import HTTPError
  5. import re
  6. import datetime
  7. import string
  8. from bs4 import BeautifulSoup
  9. import pymysql
  10. import nltk
  11. from nltk.corpus import stopwords
  12. from time import sleep # To prevent overwhelming the server

  13. date_today = datetime.date.today()
  14. . 1point3acres.com/bbs
  15. stop_words = set(stopwords.words('english'))
  16. stop_words.update(set(string.punctuation))

  17. # Open database connection-google 1point3acres
  18. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  19.                      db="jobs", charset = 'utf8')
  20. # prepare a cursor object using cursor() method. more info on 1point3acres.com
  21. cursor = db.cursor(). Waral 鍗氬鏈夋洿澶氭枃绔,

  22. # execute SQL query using execute() method.
  23. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''')
  24. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  
  25.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  26.     Location VARCHAR(150), Contents VARCHAR(3000),
  27.     Industry VARCHAR(150), Category VARCHAR(150), .鐣欏璁哄潧-涓浜-涓夊垎鍦
  28.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE,
  29.     PRIMARY KEY(id)) ''')

  30. cursor.execute(''' SELECT id FROM careerBuilder ''')
  31. data = cursor.fetchall(). 鍥磋鎴戜滑@1point 3 acres
  32. jobid_visited = set([d[0] for d in data])

  33. ds_url = "https://www.careerbuilder.com/jobs-data-scientist"
  34. sort_by = "date_desc"
  35. page = 1
  36. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  37. try:
  38.     html = urlopen(url)
  39. except HTTPError as e:
  40.     print(e) 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  41.     sys.exit()
  42. 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  43. soup = BeautifulSoup(html.read(), "lxml").1point3acres缃
  44. . visit 1point3acres.com for more.
  45. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text(). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  46. num_pages = int(re.findall(r'\d+', num_pages)[1])

  47. base_url = "https://www.careerbuilder.com". 1point3acres.com/bbs
  48. #for page in range(1, num_pages + 1):-google 1point3acres
  49. for page in range(1, num_pages + 1):
  50.     print('start page: ', page)
  51.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  52.     try:
  53.         html = urlopen(url). 1point3acres.com/bbs
  54.     except HTTPError as e:
  55.         print(page, e)
  56.         break

  57.     soup = BeautifulSoup(html.read(), "lxml").鏈枃鍘熷垱鑷1point3acres璁哄潧
  58.     jobs = soup.findAll('h2', attrs={'class':'job-title'})
  59.     for job in jobs:. From 1point 3acres bbs
  60.         # retriev job id. visit 1point3acres.com for more.
  61.         job_did = job.a.get('data-job-did')
  62.         if job_did in jobid_visited:
  63.             continue
  64.         
  65.         job_link = base_url + job.a.get('href')        
  66.         try:
  67.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  68.         except Exception as e:
  69.             print(page, job_did, e)
  70.             continue. 鍥磋鎴戜滑@1point 3 acres
  71.             
  72.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'})
  73.         title = job_detail.h1.get_text().strip()
  74.         company_location = job_detail.h2.get_text().strip().split('\n')
  75.         if len(company_location) == 1:
  76.             location = company_location[0]
  77.         elif len(company_location) == 3:
  78.             company, location = company_location[0], company_location[2]
  79.         
  80.         # retriev job posted date. 1point 3acres 璁哄潧
  81.         begin_date = job_detail.h3.get_text().strip()
  82.         begin_date = re.findall('(\d+) day', begin_date)
  83.         if len(begin_date) == 0 :
  84.             time_delta = 0. Waral 鍗氬鏈夋洿澶氭枃绔,
  85.         else :
  86.             time_delta = int(begin_date[0])
  87.         
  88.         begin_date = date_today - datetime.timedelta(days=time_delta)
  89.         begin_date = begin_date.strftime("%y/%m/%d"). 1point 3acres 璁哄潧
  90.         . from: 1point3acres.com/bbs
  91.         # retriev job category
  92.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'}). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  93.         
  94.         job_industry = snapshot.find('div', id='job-industry')
  95.         if job_industry:
  96.             job_industry = job_industry.get_text().strip(). 1point 3acres 璁哄潧
  97.         
  98.         job_category = snapshot.find('div', id='job-categories')
  99.         if job_category:
  100.             job_category = job_category.get_text().strip()
  101.         . from: 1point3acres.com/bbs
  102.         # find anually salary if available 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  103.         salary = ''
  104.         for line in snapshot.get_text().splitlines():
  105.             salary_entry = re.findall(r'^(\$.*)/Year. visit 1point3acres.com for more.


  106. , line)
  107.             if len(salary_entry) == 1:
  108.                 salary = salary_entry[0].strip(). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  109.                
  110.         job_id = job_detail.find('div', class_='small-12 columns job-id'). 鍥磋鎴戜滑@1point 3 acres
  111.         if job_id:. visit 1point3acres.com for more.
  112.             job_id = job_id.get_text().strip().split('\n')[1]
  113.         
  114.         # get job description and requirements
  115.         job_item = job_detail.find('div', class_='small-12 columns item')
  116.         text = job_item.get_text()
  117.         # break into lines
  118.         lines = (line.strip() for line in text.splitlines())
  119.         # break multi-headlines into a line each
  120.         chunks = (phrase.strip() for line in lines
  121.             for phrase in line.split(" "))
  122.    
  123.         text = ' '.join(chunk for chunk in chunks if chunk)
  124.         text = re.sub(r"[\'\-\/&]", r' ', text)
  125.         words = nltk.tokenize.word_tokenize(text.lower()).1point3acres缃
  126.         words = [w for w in words if (not w in stop_words)]
  127.         word_set = set(words)
  128.         
  129.         text = ' '.join(word_set)

  130.         if len(text) > 3000:
  131.             text = text[:3000]
  132.         
  133.         company = re.sub(r'[\'\"]', r' ', company)
  134.         location = re.sub(r'[\'\"]', r' ', location)
  135.         title = re.sub(r'[\'\"]', r' ', title)
  136.         
  137.         if len(company) > 150:
  138.             company = company[:150]
  139.             
  140.         if len(title) > 150:
  141.             title = title[:150]
  142.             
  143.         if len(location) > 150:
  144.             location = location[:150]
  145.             . visit 1point3acres.com for more.
  146. #        print(job_did, title, company, location, salary, begin_date,
  147. #              job_industry, job_category, job_link, text). 1point 3acres 璁哄潧
  148. #        print("------------------------------------------")
  149.         
  150.         sql = '''INSERT INTO careerBuilder
  151.                 (id, Company, Title, Location, Contents, Industry,
  152.                 Category, Link, Salary, Created)
  153.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s',
  154.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\
  155.                 (job_did, company, title, location, text,
  156.                   job_industry, job_category,
  157.                  job_link, salary, begin_date)
  158.         try:
  159.             cursor.execute(sql)
  160.             cursor.connection.commit()
  161.         except Exception as e:
  162.             print(job_id, e)
  163.             break
  164.    
  165.     sleep(1)

  166. # Fetch a single row using fetchone() method.
  167. cursor.execute(''' SELECT * FROM careerBuilder ''')
  168. data = cursor.fetchone()
  169. print (data)
  170.    
  171. # disconnect from server
  172. cursor.close()
  173. db.close()

复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
本帖最后由 DL 于 2017-6-11 00:35 编辑 .1point3acres缃

kaggle上jobs的关键技能排序 和 educatation 要求

  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。
. 1point3acres.com/bbs
kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求


careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序

careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-23 10:43:49 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的

indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2017-9-24 22:24

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表