推荐:数据科学课程和书籍清单以及培训讲座


一亩三分地论坛

 找回密码
 获取更多干活,快来注册

一亩三分地官方iOS手机应用下载
查看: 800|回复: 6
收起左侧

[DataScience] Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?获取更多干活,快来注册

x
本帖最后由 DL 于 2017-6-10 23:51 编辑
. visit 1point3acres.com for more.
贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  3. from urllib.error import HTTPError
  4. import re
  5. import string
  6. from bs4 import BeautifulSoup
  7. import pymysql
  8. import nltk. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server
  11. . From 1point 3acres bbs
  12. stop_words = set(stopwords.words('english')). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  13. stop_words.update(set(string.punctuation))
  14. # Open database connection
  15. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  16.                      db="jobs", charset = 'utf8')
  17. # prepare a cursor object using cursor() method
  18. cursor = db.cursor()

  19. # execute SQL query using execute() method.. 鍥磋鎴戜滑@1point 3 acres
  20. #cursor.execute('''DROP TABLE IF EXISTS kaggle''')
  21. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  22.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  23.     Location VARCHAR(150), Contents VARCHAR(3000),
  24.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50), 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  25.     PRIMARY KEY(id)) '''). more info on 1point3acres.com

  26. cursor.execute(''' SELECT id FROM kaggle ''')
  27. data = cursor.fetchall()
  28. jobid_visited = set([d[0] for d in data])

  29. jobId_start = 16800.1point3acres缃
  30. jobId_end = 17900.鏈枃鍘熷垱鑷1point3acres璁哄潧

  31. home_url = "https://www.kaggle.com/jobs/"
  32. 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  33. for job_id in range(jobId_start, jobId_end):
  34.     if job_id % 50 == 0:. Waral 鍗氬鏈夋洿澶氭枃绔,
  35.         print(job_id)
  36.         sleep(1)
  37.     if job_id in jobid_visited:
  38.         continue
  39.     job_link = "%s%s" % (home_url, job_id)
  40.     try:
  41.         html = urlopen(job_link). From 1point 3acres bbs
  42.     except HTTPError as e:
  43.         print(job_id, e)
  44.         continue
  45.     soup = BeautifulSoup(html.read(), "lxml")
  46.     job_title = soup.find('div', attrs={'class':'title'})
  47.     title = job_title.h1.getText()
  48.     company = job_title.h2.getText().鐣欏璁哄潧-涓浜-涓夊垎鍦
  49.     location = job_title.h3.getText()
  50.     submission = soup.find('p', attrs={'class':'submission-date'}). 1point3acres.com/bbs
  51.     submission_date = submission.span['title'].split()[0]
  52.     reviews = submission.contents[2]
  53.     reviews = int(''.join(list(filter(str.isdigit, reviews))))
  54.    
  55.     contents = submission.next_siblings # job descriptions tag
  56.     word_set = set()
  57.     for para in contents:
  58.         if para == '\n':
  59.             continue      

  60.         if isinstance(para, str):
  61.             text = para.strip()
  62.         else:. 1point 3acres 璁哄潧
  63.             text = para.get_text().strip(). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  64.             
  65.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  66.         words = nltk.tokenize.word_tokenize(text.lower())
  67.         words = [w for w in words if ((not w in stop_words) \. Waral 鍗氬鏈夋洿澶氭枃绔,
  68.                     and (not "'" in w))]. visit 1point3acres.com for more.
  69.         word_set.update(words)
  70.                                             
  71.     text = ' '.join(word_set)
  72.     if len(text) > 3000:
  73.         text = text[:3000]
  74.    
  75.     company = re.sub(r'[\'\"]', r' ', company).鐣欏璁哄潧-涓浜-涓夊垎鍦
  76.     location = re.sub(r'[\'\"]', r' ', location)
  77.     title = re.sub(r'[\'\"]', r' ', title)
  78.     . visit 1point3acres.com for more.
  79.     if len(company) > 150:. 鍥磋鎴戜滑@1point 3 acres
  80.         company = company[:150]
  81.         
  82.     if len(title) > 150:
  83.         title = title[:150]
  84.         
  85.     if len(location) > 150:
  86.         location = location[:150]
  87.     .鐣欏璁哄潧-涓浜-涓夊垎鍦
  88.     sql = '''INSERT INTO kaggle
  89.             (id, Company, Title, Location, Contents, Created, Reviews, Link)
  90.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\
  91.             (job_id, company, title, location, text,
  92.              submission_date, reviews, job_link). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  93.     try:
  94.         cursor.execute(sql)
  95.         cursor.connection.commit()
  96.     except Exception as e:
  97.         print(job_id, e). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  98.         break

  99. # Fetch a single row using fetchone() method.
  100. cursor.execute(''' SELECT * FROM kaggle ''')
  101. data = cursor.fetchone()
  102. print (data)
  103.    
  104. # disconnect from server
  105. cursor.close()
  106. db.close(). From 1point 3acres bbs

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*. 1point3acres.com/bbs
复制代码
indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/. 鍥磋鎴戜滑@1point 3 acres
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/

主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin
3. https://jessesw.com/Data-Science-Skills/

评分

4

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
关注一亩三分地公众号:
Warald_一亩三分地
web scraping careerbuilder
  1. # -*- coding: utf-8 -*-
  2. import sys
  3. from urllib.request import urlopen
  4. from urllib.error import HTTPError. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  5. import re
  6. import datetime
  7. import string
  8. from bs4 import BeautifulSoup
  9. import pymysql
    . 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  10. import nltk
  11. from nltk.corpus import stopwords
  12. from time import sleep # To prevent overwhelming the server

  13. date_today = datetime.date.today()

  14. stop_words = set(stopwords.words('english'))
  15. stop_words.update(set(string.punctuation))

  16. # Open database connection
  17. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  18.                      db="jobs", charset = 'utf8')
  19. # prepare a cursor object using cursor() method
  20. cursor = db.cursor()
  21. . 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  22. # execute SQL query using execute() method.
  23. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''')
  24. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  . 1point 3acres 璁哄潧
  25.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  26.     Location VARCHAR(150), Contents VARCHAR(3000),
  27.     Industry VARCHAR(150), Category VARCHAR(150),
  28.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE,
  29.     PRIMARY KEY(id)) '''). 1point 3acres 璁哄潧

  30. cursor.execute(''' SELECT id FROM careerBuilder ''').鐣欏璁哄潧-涓浜-涓夊垎鍦
  31. data = cursor.fetchall()
  32. jobid_visited = set([d[0] for d in data]). 1point3acres.com/bbs

  33. ds_url = "https://www.careerbuilder.com/jobs-data-scientist". Waral 鍗氬鏈夋洿澶氭枃绔,
  34. sort_by = "date_desc"
  35. page = 1
  36. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  37. try:
  38.     html = urlopen(url)
  39. except HTTPError as e:
  40.     print(e)-google 1point3acres
  41.     sys.exit()

  42. soup = BeautifulSoup(html.read(), "lxml").1point3acres缃

  43. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text(). 1point 3acres 璁哄潧
  44. num_pages = int(re.findall(r'\d+', num_pages)[1])

  45. base_url = "https://www.careerbuilder.com"
  46. #for page in range(1, num_pages + 1):
  47. for page in range(1, num_pages + 1):
  48.     print('start page: ', page)
  49.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
    . 1point 3acres 璁哄潧
  50.     try:
  51.         html = urlopen(url)
  52.     except HTTPError as e:. more info on 1point3acres.com
  53.         print(page, e)
  54.         break

  55.     soup = BeautifulSoup(html.read(), "lxml"). 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  56.     jobs = soup.findAll('h2', attrs={'class':'job-title'})
  57.     for job in jobs:
  58.         # retriev job id
  59.         job_did = job.a.get('data-job-did')
  60.         if job_did in jobid_visited:
  61.             continue
  62.         
  63.         job_link = base_url + job.a.get('href')        .鏈枃鍘熷垱鑷1point3acres璁哄潧
  64.         try:
  65.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  66.         except Exception as e:
  67.             print(page, job_did, e)
  68.             continue
  69.             
  70.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'})
  71.         title = job_detail.h1.get_text().strip()
  72.         company_location = job_detail.h2.get_text().strip().split('\n'). From 1point 3acres bbs
  73.         if len(company_location) == 1:
  74.             location = company_location[0]-google 1point3acres
  75.         elif len(company_location) == 3:
  76.             company, location = company_location[0], company_location[2]
  77.         
  78.         # retriev job posted date. Waral 鍗氬鏈夋洿澶氭枃绔,
  79.         begin_date = job_detail.h3.get_text().strip(). Waral 鍗氬鏈夋洿澶氭枃绔,
  80.         begin_date = re.findall('(\d+) day', begin_date). from: 1point3acres.com/bbs
  81.         if len(begin_date) == 0 :
  82.             time_delta = 0
  83.         else :
  84.             time_delta = int(begin_date[0]). Waral 鍗氬鏈夋洿澶氭枃绔,
  85.         . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  86.         begin_date = date_today - datetime.timedelta(days=time_delta)
    . 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  87.         begin_date = begin_date.strftime("%y/%m/%d")
  88.         
  89.         # retriev job category. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  90.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'})
  91.         
  92.         job_industry = snapshot.find('div', id='job-industry'). 1point3acres.com/bbs
  93.         if job_industry:
  94.             job_industry = job_industry.get_text().strip()
  95.         . 鍥磋鎴戜滑@1point 3 acres
  96.         job_category = snapshot.find('div', id='job-categories')
  97.         if job_category:
  98.             job_category = job_category.get_text().strip()
  99.         
  100.         # find anually salary if available
  101.         salary = ''
  102.         for line in snapshot.get_text().splitlines():
  103.             salary_entry = re.findall(r'^(\$.*)/Year


  104. , line)
  105.             if len(salary_entry) == 1:
  106.                 salary = salary_entry[0].strip()
  107.                
  108.         job_id = job_detail.find('div', class_='small-12 columns job-id')
  109.         if job_id:
  110.             job_id = job_id.get_text().strip().split('\n')[1]
  111.         
  112.         # get job description and requirements
  113.         job_item = job_detail.find('div', class_='small-12 columns item')
  114.         text = job_item.get_text(). 鍥磋鎴戜滑@1point 3 acres
  115.         # break into lines
  116.         lines = (line.strip() for line in text.splitlines())
  117.         # break multi-headlines into a line each
  118.         chunks = (phrase.strip() for line in lines . 鍥磋鎴戜滑@1point 3 acres
  119.             for phrase in line.split(" "))
  120.     .鏈枃鍘熷垱鑷1point3acres璁哄潧
  121.         text = ' '.join(chunk for chunk in chunks if chunk)
  122.         text = re.sub(r"[\'\-\/&]", r' ', text)
  123.         words = nltk.tokenize.word_tokenize(text.lower())
  124.         words = [w for w in words if (not w in stop_words)]. From 1point 3acres bbs
  125.         word_set = set(words)
  126.         . Waral 鍗氬鏈夋洿澶氭枃绔,
  127.         text = ' '.join(word_set)
  128. . From 1point 3acres bbs
  129.         if len(text) > 3000:. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  130.             text = text[:3000]. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  131.         
  132.         company = re.sub(r'[\'\"]', r' ', company)
  133.         location = re.sub(r'[\'\"]', r' ', location)
  134.         title = re.sub(r'[\'\"]', r' ', title)
  135.         -google 1point3acres
  136.         if len(company) > 150:
  137.             company = company[:150]. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  138.             
  139.         if len(title) > 150:
  140.             title = title[:150]
  141.             
  142.         if len(location) > 150:
  143.             location = location[:150]
  144.             
  145. #        print(job_did, title, company, location, salary, begin_date,
  146. #              job_industry, job_category, job_link, text)
  147. #        print("------------------------------------------")
  148.         . Waral 鍗氬鏈夋洿澶氭枃绔,
  149.         sql = '''INSERT INTO careerBuilder
  150.                 (id, Company, Title, Location, Contents, Industry,
  151.                 Category, Link, Salary, Created)
  152.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s', . 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  153.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\
  154.                 (job_did, company, title, location, text,
  155.                   job_industry, job_category,
  156.                  job_link, salary, begin_date)
  157.         try:
  158.             cursor.execute(sql)
  159.             cursor.connection.commit()
  160.         except Exception as e:. more info on 1point3acres.com
  161.             print(job_id, e)
  162.             break
  163.    
  164.     sleep(1)
  165. . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  166. # Fetch a single row using fetchone() method.
  167. cursor.execute(''' SELECT * FROM careerBuilder ''')
  168. data = cursor.fetchone(). 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  169. print (data) . 1point3acres.com/bbs
  170.    
  171. # disconnect from server
  172. cursor.close(). From 1point 3acres bbs
  173. db.close().鏈枃鍘熷垱鑷1point3acres璁哄潧

复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
关注一亩三分地微博:
Warald
本帖最后由 DL 于 2017-6-11 00:35 编辑
. more info on 1point3acres.com
kaggle上jobs的关键技能排序 和 educatation 要求
. From 1point 3acres bbs
  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴

kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求


careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序

careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-23 10:43:49 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的

indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2017-7-28 19:20

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表