一亩三分地论坛

 找回密码
 Sign Up 注册获取更多干货
码农求职神器Triplebyte:
不用海投,内推多家公司面试
Airbnb 数据科学职位
in analytics and inference
游戏初创公司
招聘工程师、Designer和游戏策划
游戏初创公司DreamCraft招聘工程师、UIUX Designer和游戏策划
电商初创公司Good Days
招聘SDE/UI/TPM实习生
把贵司招聘信息放这里
查看: 1309|回复: 6
收起左侧

[DataScience] Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?Sign Up 注册获取更多干货

x
本帖最后由 DL 于 2017-6-10 23:51 编辑

贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen
  3. from urllib.error import HTTPError
    鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  4. import re
  5. import string
  6. from bs4 import BeautifulSoup
  7. import pymysql
  8. import nltk
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server . Waral 鍗氬鏈夋洿澶氭枃绔,

  11. stop_words = set(stopwords.words('english'))
  12. stop_words.update(set(string.punctuation))
  13. # Open database connection
  14. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  15.                      db="jobs", charset = 'utf8'). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  16. # prepare a cursor object using cursor() method
  17. cursor = db.cursor()

  18. # execute SQL query using execute() method.
  19. #cursor.execute('''DROP TABLE IF EXISTS kaggle''')-google 1point3acres
  20. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  21.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  22.     Location VARCHAR(150), Contents VARCHAR(3000),
  23.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50),
  24.     PRIMARY KEY(id)) ''')
  25. . more info on 1point3acres.com
  26. cursor.execute(''' SELECT id FROM kaggle ''')
  27. data = cursor.fetchall()
  28. jobid_visited = set([d[0] for d in data])
  29. . From 1point 3acres bbs
  30. jobId_start = 16800
  31. jobId_end = 17900
  32. . From 1point 3acres bbs
  33. home_url = "https://www.kaggle.com/jobs/"

  34. for job_id in range(jobId_start, jobId_end):
  35.     if job_id % 50 == 0:.1point3acres缃
  36.         print(job_id). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  37.         sleep(1)
  38.     if job_id in jobid_visited:
  39.         continue.1point3acres缃
  40.     job_link = "%s%s" % (home_url, job_id)
  41.     try:
  42.         html = urlopen(job_link)
  43.     except HTTPError as e:
  44.         print(job_id, e)
  45.         continue. Waral 鍗氬鏈夋洿澶氭枃绔,
  46.     soup = BeautifulSoup(html.read(), "lxml")
  47.     job_title = soup.find('div', attrs={'class':'title'}). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  48.     title = job_title.h1.getText()
  49.     company = job_title.h2.getText()
  50.     location = job_title.h3.getText()
  51.     submission = soup.find('p', attrs={'class':'submission-date'})
  52.     submission_date = submission.span['title'].split()[0]
  53.     reviews = submission.contents[2]
  54.     reviews = int(''.join(list(filter(str.isdigit, reviews)))). more info on 1point3acres.com
  55.    
  56.     contents = submission.next_siblings # job descriptions tag
  57.     word_set = set(). 1point3acres.com/bbs
  58.     for para in contents:.1point3acres缃
  59.         if para == '\n':. visit 1point3acres.com for more.
  60.             continue       . more info on 1point3acres.com

  61.         if isinstance(para, str):-google 1point3acres
  62.             text = para.strip()
  63.         else:. From 1point 3acres bbs
  64.             text = para.get_text().strip()
  65.             
  66.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII
  67.         words = nltk.tokenize.word_tokenize(text.lower())
  68.         words = [w for w in words if ((not w in stop_words) \
  69.                     and (not "'" in w))]
  70.         word_set.update(words)
  71.                                             
  72.     text = ' '.join(word_set)
  73.     if len(text) > 3000:
  74.         text = text[:3000]
  75.    
  76.     company = re.sub(r'[\'\"]', r' ', company)
  77.     location = re.sub(r'[\'\"]', r' ', location)
  78.     title = re.sub(r'[\'\"]', r' ', title)
  79.    
  80.     if len(company) > 150:
  81.         company = company[:150]
  82.         
  83.     if len(title) > 150:
  84.         title = title[:150]
  85.         
  86.     if len(location) > 150:
  87.         location = location[:150]
  88.    
  89.     sql = '''INSERT INTO kaggle
  90.             (id, Company, Title, Location, Contents, Created, Reviews, Link)
  91.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\
  92.             (job_id, company, title, location, text,
  93.              submission_date, reviews, job_link)
  94.     try:
  95.         cursor.execute(sql)
  96.         cursor.connection.commit()
  97.     except Exception as e:. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  98.         print(job_id, e)
  99.         break

  100. # Fetch a single row using fetchone() method.
  101. cursor.execute(''' SELECT * FROM kaggle ''')
  102. data = cursor.fetchone()
  103. print (data)
  104.    
  105. # disconnect from server. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  106. cursor.close()
    .1point3acres缃
  107. db.close()

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*
复制代码
indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/. From 1point 3acres bbs
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/
. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin
3. https://jessesw.com/Data-Science-Skills/

评分

4

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
web scraping careerbuilder
  1. # -*- coding: utf-8 -*-
  2. import sys
  3. from urllib.request import urlopen
  4. from urllib.error import HTTPError
  5. import re. Waral 鍗氬鏈夋洿澶氭枃绔,
  6. import datetime
  7. import string
  8. from bs4 import BeautifulSoup
  9. import pymysql
  10. import nltk
  11. from nltk.corpus import stopwords
  12. from time import sleep # To prevent overwhelming the server

  13. date_today = datetime.date.today()
  14. . from: 1point3acres.com/bbs
  15. stop_words = set(stopwords.words('english'))
  16. stop_words.update(set(string.punctuation))

  17. # Open database connection
  18. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  19.                      db="jobs", charset = 'utf8')
  20. # prepare a cursor object using cursor() method
  21. cursor = db.cursor()

  22. # execute SQL query using execute() method.
  23. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''')
  24. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  
  25.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  26.     Location VARCHAR(150), Contents VARCHAR(3000),
  27.     Industry VARCHAR(150), Category VARCHAR(150),
  28.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE,
  29.     PRIMARY KEY(id)) ''')

  30. cursor.execute(''' SELECT id FROM careerBuilder ''')
  31. data = cursor.fetchall()
  32. jobid_visited = set([d[0] for d in data])
  33. .鐣欏璁哄潧-涓浜-涓夊垎鍦
  34. ds_url = "https://www.careerbuilder.com/jobs-data-scientist". from: 1point3acres.com/bbs
  35. sort_by = "date_desc"
  36. page = 1
  37. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  38. try:.1point3acres缃
  39.     html = urlopen(url)
  40. except HTTPError as e:
  41.     print(e)
  42.     sys.exit()

  43. soup = BeautifulSoup(html.read(), "lxml")

  44. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text()
  45. num_pages = int(re.findall(r'\d+', num_pages)[1])

  46. base_url = "https://www.careerbuilder.com"
  47. #for page in range(1, num_pages + 1):
  48. for page in range(1, num_pages + 1):
  49.     print('start page: ', page)
  50.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  51.     try:
  52.         html = urlopen(url)
  53.     except HTTPError as e: 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  54.         print(page, e)
  55.         break

  56.     soup = BeautifulSoup(html.read(), "lxml")
  57.     jobs = soup.findAll('h2', attrs={'class':'job-title'}).1point3acres缃
  58.     for job in jobs:. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  59.         # retriev job id
  60.         job_did = job.a.get('data-job-did'). 1point3acres.com/bbs
  61.         if job_did in jobid_visited:
  62.             continue
  63.         
  64.         job_link = base_url + job.a.get('href')        
  65.         try:
  66.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  67.         except Exception as e:
  68.             print(page, job_did, e)
  69.             continue-google 1point3acres
  70.             
  71.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'})
  72.         title = job_detail.h1.get_text().strip().1point3acres缃
  73.         company_location = job_detail.h2.get_text().strip().split('\n')
  74.         if len(company_location) == 1:.1point3acres缃
  75.             location = company_location[0]
  76.         elif len(company_location) == 3:
  77.             company, location = company_location[0], company_location[2]
  78.         
  79.         # retriev job posted date
  80.         begin_date = job_detail.h3.get_text().strip()
  81.         begin_date = re.findall('(\d+) day', begin_date)
  82.         if len(begin_date) == 0 :
  83.             time_delta = 0. from: 1point3acres.com/bbs
  84.         else :
  85.             time_delta = int(begin_date[0])
  86.         
  87.         begin_date = date_today - datetime.timedelta(days=time_delta)
  88.         begin_date = begin_date.strftime("%y/%m/%d")
  89.         . 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  90.         # retriev job category
  91.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'})
  92.         
  93.         job_industry = snapshot.find('div', id='job-industry'). more info on 1point3acres.com
  94.         if job_industry:
  95.             job_industry = job_industry.get_text().strip()
  96.         
  97.         job_category = snapshot.find('div', id='job-categories').1point3acres缃
  98.         if job_category:
  99.             job_category = job_category.get_text().strip()
  100.         . 鍥磋鎴戜滑@1point 3 acres
  101.         # find anually salary if available
  102.         salary = ''.鐣欏璁哄潧-涓浜-涓夊垎鍦
  103.         for line in snapshot.get_text().splitlines():
  104.             salary_entry = re.findall(r'^(\$.*)/Year


  105. , line)
  106.             if len(salary_entry) == 1:
    . 1point 3acres 璁哄潧
  107.                 salary = salary_entry[0].strip()
  108.                
  109.         job_id = job_detail.find('div', class_='small-12 columns job-id')
  110.         if job_id:
  111.             job_id = job_id.get_text().strip().split('\n')[1]
    .1point3acres缃
  112.         
  113.         # get job description and requirements. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  114.         job_item = job_detail.find('div', class_='small-12 columns item')
  115.         text = job_item.get_text()
  116.         # break into lines
  117.         lines = (line.strip() for line in text.splitlines())
  118.         # break multi-headlines into a line each
  119.         chunks = (phrase.strip() for line in lines
  120.             for phrase in line.split(" "))
  121.    
  122.         text = ' '.join(chunk for chunk in chunks if chunk)
  123.         text = re.sub(r"[\'\-\/&]", r' ', text)
  124.         words = nltk.tokenize.word_tokenize(text.lower())
  125.         words = [w for w in words if (not w in stop_words)]
  126.         word_set = set(words)
  127.         
  128.         text = ' '.join(word_set)

  129.         if len(text) > 3000:
  130.             text = text[:3000]
  131.         
  132.         company = re.sub(r'[\'\"]', r' ', company)
  133.         location = re.sub(r'[\'\"]', r' ', location)
  134.         title = re.sub(r'[\'\"]', r' ', title)
  135.         
  136.         if len(company) > 150:. From 1point 3acres bbs
  137.             company = company[:150]
  138.             
  139.         if len(title) > 150:-google 1point3acres
  140.             title = title[:150]
  141.             
  142.         if len(location) > 150:
  143.             location = location[:150]
  144.             
  145. #        print(job_did, title, company, location, salary, begin_date,
  146. #              job_industry, job_category, job_link, text)
  147. #        print("------------------------------------------")
  148.         
  149.         sql = '''INSERT INTO careerBuilder
  150.                 (id, Company, Title, Location, Contents, Industry,
  151.                 Category, Link, Salary, Created)
  152.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s',
  153.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\
  154.                 (job_did, company, title, location, text, . 1point3acres.com/bbs
  155.                   job_industry, job_category,
  156.                  job_link, salary, begin_date)
  157.         try:
  158.             cursor.execute(sql).鐣欏璁哄潧-涓浜-涓夊垎鍦
  159.             cursor.connection.commit()
  160.         except Exception as e:
  161.             print(job_id, e)
  162.             break
  163.     . Waral 鍗氬鏈夋洿澶氭枃绔,
  164.     sleep(1)

  165. # Fetch a single row using fetchone() method.
  166. cursor.execute(''' SELECT * FROM careerBuilder ''').1point3acres缃
  167. data = cursor.fetchone().鐣欏璁哄潧-涓浜-涓夊垎鍦
  168. print (data)
  169.    
    . 1point 3acres 璁哄潧
  170. # disconnect from server
  171. cursor.close()
  172. db.close()

复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
本帖最后由 DL 于 2017-6-11 00:35 编辑

kaggle上jobs的关键技能排序 和 educatation 要求.1point3acres缃
. 鍥磋鎴戜滑@1point 3 acres
  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。
. 1point 3acres 璁哄潧
kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求

鏉ユ簮涓浜.涓夊垎鍦拌鍧.
careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序

careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-23 10:43:49 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的

indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2018-1-19 06:04

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表