近期论坛无法登录的解决方案


一亩三分地论坛

 找回密码
 获取更多干活,快来注册

一亩三分地官方iOS手机应用下载
查看: 550|回复: 6
收起左侧

[DataScience] Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?获取更多干活,快来注册

x
本帖最后由 DL 于 2017-6-10 23:51 编辑

贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen
    . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  3. from urllib.error import HTTPError
  4. import re. 1point3acres.com/bbs
  5. import string
  6. from bs4 import BeautifulSoup. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  7. import pymysql
  8. import nltk
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server . From 1point 3acres bbs

  11. stop_words = set(stopwords.words('english')) 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  12. stop_words.update(set(string.punctuation)). 1point3acres.com/bbs
  13. # Open database connection
  14. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  15.                      db="jobs", charset = 'utf8'). From 1point 3acres bbs
  16. # prepare a cursor object using cursor() method
  17. cursor = db.cursor()
  18. . 鍥磋鎴戜滑@1point 3 acres
  19. # execute SQL query using execute() method.
  20. #cursor.execute('''DROP TABLE IF EXISTS kaggle'''). 鍥磋鎴戜滑@1point 3 acres
  21. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  22.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  23.     Location VARCHAR(150), Contents VARCHAR(3000), . 鍥磋鎴戜滑@1point 3 acres
  24.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50), -google 1point3acres
  25.     PRIMARY KEY(id)) ''')

  26. cursor.execute(''' SELECT id FROM kaggle ''')
  27. data = cursor.fetchall()
  28. jobid_visited = set([d[0] for d in data])

  29. jobId_start = 16800
  30. jobId_end = 17900

  31. home_url = "https://www.kaggle.com/jobs/"

  32. for job_id in range(jobId_start, jobId_end):
  33.     if job_id % 50 == 0:
  34.         print(job_id)
  35.         sleep(1)
  36.     if job_id in jobid_visited:
  37.         continue. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  38.     job_link = "%s%s" % (home_url, job_id). Waral 鍗氬鏈夋洿澶氭枃绔,
  39.     try:
  40.         html = urlopen(job_link). From 1point 3acres bbs
  41.     except HTTPError as e:
  42.         print(job_id, e).1point3acres缃
  43.         continue
  44.     soup = BeautifulSoup(html.read(), "lxml")
  45.     job_title = soup.find('div', attrs={'class':'title'})
  46.     title = job_title.h1.getText()
  47.     company = job_title.h2.getText()
  48.     location = job_title.h3.getText()
  49.     submission = soup.find('p', attrs={'class':'submission-date'})
  50.     submission_date = submission.span['title'].split()[0]
  51.     reviews = submission.contents[2].1point3acres缃
  52.     reviews = int(''.join(list(filter(str.isdigit, reviews))))
  53.    
  54.     contents = submission.next_siblings # job descriptions tag
  55.     word_set = set()
  56.     for para in contents:
  57.         if para == '\n':
  58.             continue      
  59. . Waral 鍗氬鏈夋洿澶氭枃绔,
  60.         if isinstance(para, str):.鐣欏璁哄潧-涓浜-涓夊垎鍦
  61.             text = para.strip()
  62.         else:
  63.             text = para.get_text().strip()
  64.             
  65.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII
  66.         words = nltk.tokenize.word_tokenize(text.lower())
  67.         words = [w for w in words if ((not w in stop_words) \
  68.                     and (not "'" in w))]
  69.         word_set.update(words)
  70.                                             
  71.     text = ' '.join(word_set). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  72.     if len(text) > 3000:
  73.         text = text[:3000]
  74.    
  75.     company = re.sub(r'[\'\"]', r' ', company)
  76.     location = re.sub(r'[\'\"]', r' ', location)
  77.     title = re.sub(r'[\'\"]', r' ', title)
  78.    
  79.     if len(company) > 150:
  80.         company = company[:150]
  81.         
  82.     if len(title) > 150:
  83.         title = title[:150]
  84.         
  85.     if len(location) > 150:
    . visit 1point3acres.com for more.
  86.         location = location[:150]
  87.    
  88.     sql = '''INSERT INTO kaggle . From 1point 3acres bbs
  89.             (id, Company, Title, Location, Contents, Created, Reviews, Link). 1point 3acres 璁哄潧
  90.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\-google 1point3acres
  91.             (job_id, company, title, location, text,
  92.              submission_date, reviews, job_link)
  93.     try:. Waral 鍗氬鏈夋洿澶氭枃绔,
  94.         cursor.execute(sql)
  95.         cursor.connection.commit(). From 1point 3acres bbs
  96.     except Exception as e:
  97.         print(job_id, e)
  98.         break. visit 1point3acres.com for more.

  99. # Fetch a single row using fetchone() method.
  100. cursor.execute(''' SELECT * FROM kaggle '''). From 1point 3acres bbs
  101. data = cursor.fetchone()
  102. print (data)
  103.    
  104. # disconnect from server
  105. cursor.close()
    . 1point3acres.com/bbs
  106. db.close()

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*
复制代码
Indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/

主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin. From 1point 3acres bbs
3. https://jessesw.com/Data-Science-Skills/

评分

2

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
关注一亩三分地公众号:
Warald_一亩三分地
web scraping careerbuilder
  1. # -*- coding: utf-8 -*-
  2. import sys
  3. from urllib.request import urlopen
  4. from urllib.error import HTTPError
  5. import re 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  6. import datetime
  7. import string
  8. from bs4 import BeautifulSoup
  9. import pymysql 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  10. import nltk
  11. from nltk.corpus import stopwords
  12. from time import sleep # To prevent overwhelming the server

  13. date_today = datetime.date.today() . more info on 1point3acres.com

  14. stop_words = set(stopwords.words('english'))
  15. stop_words.update(set(string.punctuation))

  16. # Open database connection
  17. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  18.                      db="jobs", charset = 'utf8'). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  19. # prepare a cursor object using cursor() method
  20. cursor = db.cursor()

  21. # execute SQL query using execute() method.
  22. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''') 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  23. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  
  24.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  25.     Location VARCHAR(150), Contents VARCHAR(3000),
  26.     Industry VARCHAR(150), Category VARCHAR(150), 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  27.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE, . 1point 3acres 璁哄潧
  28.     PRIMARY KEY(id)) ''') 鏉ユ簮涓浜.涓夊垎鍦拌鍧.

  29. cursor.execute(''' SELECT id FROM careerBuilder ''')
  30. data = cursor.fetchall()
  31. jobid_visited = set([d[0] for d in data])
  32. . visit 1point3acres.com for more.
  33. ds_url = "https://www.careerbuilder.com/jobs-data-scientist"
  34. sort_by = "date_desc"
  35. page = 1
  36. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  37. try:
  38.     html = urlopen(url)
  39. except HTTPError as e:
  40.     print(e)
  41.     sys.exit(). 1point 3acres 璁哄潧

  42. soup = BeautifulSoup(html.read(), "lxml")
  43. 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  44. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text()
  45. num_pages = int(re.findall(r'\d+', num_pages)[1])
  46. .鏈枃鍘熷垱鑷1point3acres璁哄潧
  47. base_url = "https://www.careerbuilder.com"
  48. #for page in range(1, num_pages + 1):
  49. for page in range(1, num_pages + 1):. 1point3acres.com/bbs
  50.     print('start page: ', page)
  51.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  52.     try:
  53.         html = urlopen(url). 鍥磋鎴戜滑@1point 3 acres
  54.     except HTTPError as e:. from: 1point3acres.com/bbs
  55.         print(page, e).鐣欏璁哄潧-涓浜-涓夊垎鍦
  56.         break

  57.     soup = BeautifulSoup(html.read(), "lxml")
  58.     jobs = soup.findAll('h2', attrs={'class':'job-title'})
  59.     for job in jobs:
  60.         # retriev job id
  61.         job_did = job.a.get('data-job-did')
  62.         if job_did in jobid_visited:
  63.             continue
  64.         
  65.         job_link = base_url + job.a.get('href')        
  66.         try:
  67.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  68.         except Exception as e:
  69.             print(page, job_did, e)
  70.             continue
  71.             
  72.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'}). From 1point 3acres bbs
  73.         title = job_detail.h1.get_text().strip()
  74.         company_location = job_detail.h2.get_text().strip().split('\n')
  75.         if len(company_location) == 1:
    -google 1point3acres
  76.             location = company_location[0]
  77.         elif len(company_location) == 3:
  78.             company, location = company_location[0], company_location[2]
  79.         . from: 1point3acres.com/bbs
  80.         # retriev job posted date
  81.         begin_date = job_detail.h3.get_text().strip()
  82.         begin_date = re.findall('(\d+) day', begin_date)
  83.         if len(begin_date) == 0 :
  84.             time_delta = 0
  85.         else :
  86.             time_delta = int(begin_date[0])
  87.         
  88.         begin_date = date_today - datetime.timedelta(days=time_delta).1point3acres缃
  89.         begin_date = begin_date.strftime("%y/%m/%d")
  90.         
  91.         # retriev job category
  92.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'})
  93.         
  94.         job_industry = snapshot.find('div', id='job-industry')
  95.         if job_industry:
  96.             job_industry = job_industry.get_text().strip()
  97.         
  98.         job_category = snapshot.find('div', id='job-categories'). more info on 1point3acres.com
  99.         if job_category:
  100.             job_category = job_category.get_text().strip()
  101.         
  102.         # find anually salary if available
    .鐣欏璁哄潧-涓浜-涓夊垎鍦
  103.         salary = ''
  104.         for line in snapshot.get_text().splitlines():
  105.             salary_entry = re.findall(r'^(\$.*)/Year.鐣欏璁哄潧-涓浜-涓夊垎鍦
  106. . Waral 鍗氬鏈夋洿澶氭枃绔,
  107. . 鍥磋鎴戜滑@1point 3 acres
  108. , line)
  109.             if len(salary_entry) == 1:
  110.                 salary = salary_entry[0].strip().1point3acres缃
  111.                
  112.         job_id = job_detail.find('div', class_='small-12 columns job-id')
  113.         if job_id:
  114.             job_id = job_id.get_text().strip().split('\n')[1]
  115.         
  116.         # get job description and requirements
  117.         job_item = job_detail.find('div', class_='small-12 columns item')
  118.         text = job_item.get_text()
  119.         # break into lines
  120.         lines = (line.strip() for line in text.splitlines())
  121.         # break multi-headlines into a line each
  122.         chunks = (phrase.strip() for line in lines
  123.             for phrase in line.split(" "))
  124.    
  125.         text = ' '.join(chunk for chunk in chunks if chunk)
  126.         text = re.sub(r"[\'\-\/&]", r' ', text)
  127.         words = nltk.tokenize.word_tokenize(text.lower())
  128.         words = [w for w in words if (not w in stop_words)]
  129.         word_set = set(words)
  130.         
  131.         text = ' '.join(word_set).鏈枃鍘熷垱鑷1point3acres璁哄潧

  132.         if len(text) > 3000:. from: 1point3acres.com/bbs
  133.             text = text[:3000]
  134.         
  135.         company = re.sub(r'[\'\"]', r' ', company)
  136.         location = re.sub(r'[\'\"]', r' ', location). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  137.         title = re.sub(r'[\'\"]', r' ', title)
  138.         
  139.         if len(company) > 150:
  140.             company = company[:150]
  141.             
  142.         if len(title) > 150:
  143.             title = title[:150]
  144.             
  145.         if len(location) > 150:
  146.             location = location[:150]
  147.             
  148. #        print(job_did, title, company, location, salary, begin_date,
  149. #              job_industry, job_category, job_link, text)
  150. #        print("------------------------------------------")
  151.         
  152.         sql = '''INSERT INTO careerBuilder
  153.                 (id, Company, Title, Location, Contents, Industry,
  154.                 Category, Link, Salary, Created)
  155.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s',
  156.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  157.                 (job_did, company, title, location, text,
  158.                   job_industry, job_category,
  159.                  job_link, salary, begin_date)
  160.         try:
  161.             cursor.execute(sql)
  162.             cursor.connection.commit()
  163.         except Exception as e:
  164.             print(job_id, e)
  165.             break
  166.    
  167.     sleep(1)

  168. # Fetch a single row using fetchone() method.
  169. cursor.execute(''' SELECT * FROM careerBuilder ''')
  170. data = cursor.fetchone()
  171. print (data)
  172.    
  173. # disconnect from server
  174. cursor.close()
  175. db.close()

复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
关注一亩三分地微博:
Warald
本帖最后由 DL 于 2017-6-11 00:35 编辑

kaggle上jobs的关键技能排序 和 educatation 要求. From 1point 3acres bbs

  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。

kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求
. from: 1point3acres.com/bbs

careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序

careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 6 天前 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02. From 1point 3acres bbs
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的

indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2017-6-29 10:13

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表