《数据科学面试40+真题讲解》,K神本年度最后一次开课


一亩三分地论坛

 找回密码
 Sign Up 注册获取更多干货
码农求职神器Triplebyte:
不用海投,内推你去多家公司面试
Airbnb 数据科学职位
in analytics and inference
天天打游戏、照样领工资,
你要不要来?
把贵司招聘信息放这里
查看: 1170|回复: 6
收起左侧

[DataScience] Web scraping and exploratory analysis of data scientist job ads

[复制链接] |试试Instant~ |关注本帖
DL 发表于 2017-6-10 23:36:13 | 显示全部楼层 |阅读模式

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看,没有帐号?Sign Up 注册获取更多干货

x
本帖最后由 DL 于 2017-6-10 23:51 编辑
. visit 1point3acres.com for more.
贴些招工广告的抓数据程序和分析结果,希望对大家有用,也起个抛砖引玉的作用。
1. Web scraping of kaggle jobs
  1. # -*- coding: utf-8 -*-
  2. from urllib.request import urlopen
  3. from urllib.error import HTTPError
  4. import re. visit 1point3acres.com for more.
  5. import string
  6. from bs4 import BeautifulSoup
  7. import pymysql.鏈枃鍘熷垱鑷1point3acres璁哄潧
  8. import nltk
  9. from nltk.corpus import stopwords
  10. from time import sleep # To prevent overwhelming the server
  11. . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  12. stop_words = set(stopwords.words('english'))
  13. stop_words.update(set(string.punctuation))
  14. # Open database connection
  15. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  16.                      db="jobs", charset = 'utf8')
  17. # prepare a cursor object using cursor() method
  18. cursor = db.cursor()

  19. # execute SQL query using execute() method.
  20. #cursor.execute('''DROP TABLE IF EXISTS kaggle''')
  21. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS kaggle  
  22.     (id INT NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  23.     Location VARCHAR(150), Contents VARCHAR(3000),
  24.     Created VARCHAR(20), Reviews INT, Link VARCHAR(50),
  25.     PRIMARY KEY(id)) ''')

  26. cursor.execute(''' SELECT id FROM kaggle '''). 1point 3acres 璁哄潧
  27. data = cursor.fetchall()
  28. jobid_visited = set([d[0] for d in data])

  29. jobId_start = 16800
  30. jobId_end = 17900. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷

  31. home_url = "https://www.kaggle.com/jobs/"

  32. for job_id in range(jobId_start, jobId_end):
  33.     if job_id % 50 == 0:
  34.         print(job_id)
  35.         sleep(1)
  36.     if job_id in jobid_visited:
  37.         continue
  38.     job_link = "%s%s" % (home_url, job_id)
  39.     try:
    . 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  40.         html = urlopen(job_link)
  41.     except HTTPError as e:
  42.         print(job_id, e)
  43.         continue
  44.     soup = BeautifulSoup(html.read(), "lxml"). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  45.     job_title = soup.find('div', attrs={'class':'title'})
  46.     title = job_title.h1.getText()
  47.     company = job_title.h2.getText(). 1point 3acres 璁哄潧
  48.     location = job_title.h3.getText(). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  49.     submission = soup.find('p', attrs={'class':'submission-date'})
  50.     submission_date = submission.span['title'].split()[0] 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  51.     reviews = submission.contents[2]
  52.     reviews = int(''.join(list(filter(str.isdigit, reviews))))
  53.    
  54.     contents = submission.next_siblings # job descriptions tag
  55.     word_set = set()
  56.     for para in contents:
  57.         if para == '\n':
  58.             continue       . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴

  59.         if isinstance(para, str):
  60.             text = para.strip(). from: 1point3acres.com/bbs
  61.         else:
  62.             text = para.get_text().strip()-google 1point3acres
  63.             
  64.         text = re.sub(r'[^\x00-\x7f]', r'', text) # remove non ASCII
  65.         words = nltk.tokenize.word_tokenize(text.lower()). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  66.         words = [w for w in words if ((not w in stop_words) \
  67.                     and (not "'" in w))]
  68.         word_set.update(words)
  69.                                             
  70.     text = ' '.join(word_set)
  71.     if len(text) > 3000:
  72.         text = text[:3000]
  73.    
  74.     company = re.sub(r'[\'\"]', r' ', company)
  75.     location = re.sub(r'[\'\"]', r' ', location)
    . Waral 鍗氬鏈夋洿澶氭枃绔,
  76.     title = re.sub(r'[\'\"]', r' ', title)
  77.    
  78.     if len(company) > 150:
  79.         company = company[:150]. 鍥磋鎴戜滑@1point 3 acres
  80.         -google 1point3acres
  81.     if len(title) > 150:
  82.         title = title[:150]
  83.         
  84.     if len(location) > 150:
  85.         location = location[:150]
  86.    
  87.     sql = '''INSERT INTO kaggle
  88.             (id, Company, Title, Location, Contents, Created, Reviews, Link)
  89.             VALUES ('%d', '%s', '%s', '%s', '%s', '%s', '%d', '%s')''' %\
  90.             (job_id, company, title, location, text,
  91.              submission_date, reviews, job_link)
  92.     try:
  93.         cursor.execute(sql).鏈枃鍘熷垱鑷1point3acres璁哄潧
  94.         cursor.connection.commit()
  95.     except Exception as e:. Waral 鍗氬鏈夋洿澶氭枃绔,
  96.         print(job_id, e). 鍥磋鎴戜滑@1point 3 acres
  97.         break

  98. # Fetch a single row using fetchone() method.. Waral 鍗氬鏈夋洿澶氭枃绔,
  99. cursor.execute(''' SELECT * FROM kaggle ''')
  100. data = cursor.fetchone()
  101. print (data) .1point3acres缃
  102.    
  103. # disconnect from server
  104. cursor.close()
  105. db.close()

复制代码
2. Web scraping of careerbuilder
  1. # -*- coding: utf-8 -*. 1point 3acres 璁哄潧
复制代码
indeed.com 的 web scraping code 见 Jesse's blog. https://jessesw.com/Data-Science-Skills/. 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
关键词分析 code 也见 https://jessesw.com/Data-Science-Skills/

主要参考:
1. "Web Scraping with Python" by Ryan Mitchell
2.  "Data Science in R: A Case Studies Approach to Computational Reasoning and Problem solving" Ch 12. Exploring data science jobs with web scraping and text minin
3. https://jessesw.com/Data-Science-Skills/

评分

4

查看全部评分

 楼主| DL 发表于 2017-6-10 23:56:50 | 显示全部楼层
web scraping careerbuilder
  1. # -*- coding: utf-8 -*-. 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  2. import sys
  3. from urllib.request import urlopen
  4. from urllib.error import HTTPError
  5. import re
  6. import datetime
  7. import string. 1point 3acres 璁哄潧
  8. from bs4 import BeautifulSoup
  9. import pymysql
  10. import nltk-google 1point3acres
  11. from nltk.corpus import stopwords.鏈枃鍘熷垱鑷1point3acres璁哄潧
  12. from time import sleep # To prevent overwhelming the server
  13. . From 1point 3acres bbs
  14. date_today = datetime.date.today()

  15. stop_words = set(stopwords.words('english'))
  16. stop_words.update(set(string.punctuation))

  17. # Open database connection
  18. db = pymysql.connect("localhost", user = "testuser", passwd = "******",
  19.                      db="jobs", charset = 'utf8')
  20. # prepare a cursor object using cursor() method. visit 1point3acres.com for more.
  21. cursor = db.cursor()

  22. # execute SQL query using execute() method.
  23. #cursor.execute('''DROP TABLE IF EXISTS careerBuilder''')
  24. cursor.execute ('''  CREATE TABLE  IF NOT EXISTS careerBuilder  
  25.     (id VARCHAR(50) NOT NULL, Company VARCHAR(150), Title VARCHAR(150),
  26.     Location VARCHAR(150), Contents VARCHAR(3000), .鏈枃鍘熷垱鑷1point3acres璁哄潧
  27.     Industry VARCHAR(150), Category VARCHAR(150),
  28.     Link VARCHAR(200), Salary VARCHAR(50), Created DATE,
  29.     PRIMARY KEY(id)) ''')

  30. cursor.execute(''' SELECT id FROM careerBuilder ''').鐣欏璁哄潧-涓浜-涓夊垎鍦
  31. data = cursor.fetchall()
  32. jobid_visited = set([d[0] for d in data])

  33. ds_url = "https://www.careerbuilder.com/jobs-data-scientist". 1point3acres.com/bbs
  34. sort_by = "date_desc"
  35. page = 1. visit 1point3acres.com for more.
  36. url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  37. try:.1point3acres缃
  38.     html = urlopen(url)
  39. except HTTPError as e:
  40.     print(e)
  41.     sys.exit()
  42. . 1point 3acres 璁哄潧
  43. soup = BeautifulSoup(html.read(), "lxml"). 1point 3acres 璁哄潧

  44. num_pages = soup.find('span', attrs = {'class':'page-count'}).get_text(). 1point3acres.com/bbs
  45. num_pages = int(re.findall(r'\d+', num_pages)[1])

  46. base_url = "https://www.careerbuilder.com"
  47. #for page in range(1, num_pages + 1):
  48. for page in range(1, num_pages + 1):. more info on 1point3acres.com
  49.     print('start page: ', page). From 1point 3acres bbs
  50.     url = "%s?page_number=%d&sort=%s" % (ds_url, page, sort_by)
  51.     try:. 涓浜-涓夊垎-鍦帮紝鐙鍙戝竷
  52.         html = urlopen(url)
  53.     except HTTPError as e: 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  54.         print(page, e)
  55.         break. 鍥磋鎴戜滑@1point 3 acres

  56.     soup = BeautifulSoup(html.read(), "lxml"). 鐗涗汉浜戦泦,涓浜╀笁鍒嗗湴
  57.     jobs = soup.findAll('h2', attrs={'class':'job-title'})
  58.     for job in jobs:
  59.         # retriev job id
  60.         job_did = job.a.get('data-job-did')
  61.         if job_did in jobid_visited:
  62.             continue
  63.         
  64.         job_link = base_url + job.a.get('href')        
  65.         try:. more info on 1point3acres.com
  66.             job_soup = BeautifulSoup(urlopen(job_link).read(), "lxml")
  67.         except Exception as e:
  68.             print(page, job_did, e)
  69.             continue
  70.             
  71.         job_detail = job_soup.find('div', attrs={'class':'card with-padding'})
  72.         title = job_detail.h1.get_text().strip()
    . visit 1point3acres.com for more.
  73.         company_location = job_detail.h2.get_text().strip().split('\n')
  74.         if len(company_location) == 1:
  75.             location = company_location[0]
  76.         elif len(company_location) == 3:
  77.             company, location = company_location[0], company_location[2]
  78.         
  79.         # retriev job posted date
  80.         begin_date = job_detail.h3.get_text().strip().鏈枃鍘熷垱鑷1point3acres璁哄潧
  81.         begin_date = re.findall('(\d+) day', begin_date)
  82.         if len(begin_date) == 0 :-google 1point3acres
  83.             time_delta = 0
  84.         else :. Waral 鍗氬鏈夋洿澶氭枃绔,
  85.             time_delta = int(begin_date[0])
  86.         -google 1point3acres
  87.         begin_date = date_today - datetime.timedelta(days=time_delta) 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  88.         begin_date = begin_date.strftime("%y/%m/%d")
  89.         
  90.         # retriev job category
  91.         snapshot = job_detail.find('div', attrs={'class':'job-facts item'})
  92.         
  93.         job_industry = snapshot.find('div', id='job-industry'). visit 1point3acres.com for more.
  94.         if job_industry:. 鍥磋鎴戜滑@1point 3 acres
  95.             job_industry = job_industry.get_text().strip()
  96.         
  97.         job_category = snapshot.find('div', id='job-categories')
  98.         if job_category:
  99.             job_category = job_category.get_text().strip()
  100.         
  101.         # find anually salary if available
  102.         salary = ''
  103.         for line in snapshot.get_text().splitlines():
  104.             salary_entry = re.findall(r'^(\$.*)/Year.鏈枃鍘熷垱鑷1point3acres璁哄潧


  105. , line)
  106.             if len(salary_entry) == 1:
  107.                 salary = salary_entry[0].strip()
  108.                
  109.         job_id = job_detail.find('div', class_='small-12 columns job-id')
  110.         if job_id:. 鍥磋鎴戜滑@1point 3 acres
  111.             job_id = job_id.get_text().strip().split('\n')[1]
  112.         
  113.         # get job description and requirements
  114.         job_item = job_detail.find('div', class_='small-12 columns item')
  115.         text = job_item.get_text()-google 1point3acres
  116.         # break into lines
  117.         lines = (line.strip() for line in text.splitlines())
  118.         # break multi-headlines into a line each
  119.         chunks = (phrase.strip() for line in lines
  120.             for phrase in line.split(" "))
  121.    
  122.         text = ' '.join(chunk for chunk in chunks if chunk)
  123.         text = re.sub(r"[\'\-\/&]", r' ', text) 鏉ユ簮涓浜.涓夊垎鍦拌鍧.
  124.         words = nltk.tokenize.word_tokenize(text.lower())
  125.         words = [w for w in words if (not w in stop_words)]
  126.         word_set = set(words). 鍥磋鎴戜滑@1point 3 acres
  127.         
  128.         text = ' '.join(word_set)

  129.         if len(text) > 3000:
  130.             text = text[:3000]
  131.         . Waral 鍗氬鏈夋洿澶氭枃绔,
  132.         company = re.sub(r'[\'\"]', r' ', company)
  133.         location = re.sub(r'[\'\"]', r' ', location)
  134.         title = re.sub(r'[\'\"]', r' ', title)
  135.         
  136.         if len(company) > 150:
  137.             company = company[:150]
  138.             .鐣欏璁哄潧-涓浜-涓夊垎鍦
  139.         if len(title) > 150:
  140.             title = title[:150]
  141.             -google 1point3acres
  142.         if len(location) > 150:
  143.             location = location[:150]
  144.             
  145. #        print(job_did, title, company, location, salary, begin_date,
  146. #              job_industry, job_category, job_link, text)
  147. #        print("------------------------------------------")
  148.         
  149.         sql = '''INSERT INTO careerBuilder
  150.                 (id, Company, Title, Location, Contents, Industry,
  151.                 Category, Link, Salary, Created)
  152.                 VALUES ('%s', '%s', '%s', '%s', '%s', '%s', '%s', . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  153.                 '%s', '%s', STR_TO_DATE('%s', '%%y/%%m/%%d'))''' %\
  154.                 (job_did, company, title, location, text,
  155.                   job_industry, job_category,
  156.                  job_link, salary, begin_date)
    . 鐣欏鐢宠璁哄潧-涓浜╀笁鍒嗗湴
  157.         try:
  158.             cursor.execute(sql)
  159.             cursor.connection.commit()
  160.         except Exception as e:
  161.             print(job_id, e). From 1point 3acres bbs
  162.             break
  163.    
  164.     sleep(1)

  165. # Fetch a single row using fetchone() method.
  166. cursor.execute(''' SELECT * FROM careerBuilder ''')
  167. data = cursor.fetchone(). Waral 鍗氬鏈夋洿澶氭枃绔,
  168. print (data) .1point3acres缃
  169.    
  170. # disconnect from server. 鍥磋鎴戜滑@1point 3 acres
  171. cursor.close().鏈枃鍘熷垱鑷1point3acres璁哄潧
  172. db.close()

复制代码
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:03:35 | 显示全部楼层
本帖最后由 DL 于 2017-6-11 00:35 编辑 -google 1point3acres

kaggle上jobs的关键技能排序 和 educatation 要求-google 1point3acres
. more info on 1point3acres.com
  • kaggle 列的job更接近data scientist/engineer。
  • 和书中以前的数据比,Python, machine learning, Spark, Deep learning, cloud 占的比例在上升。
  • PhD 更受欢迎 。

kaggle_skills.png
kaggle_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-11 00:13:40 | 显示全部楼层
career builder 上jobs的关键技能排序 和 educatation 要求
. more info on 1point3acres.com

careerbuilder_skills.png
careerbuilder_education.png
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-14 21:22:20 | 显示全部楼层
career builder 上按data scientist关键词搜到的工作(其中会包含一些data analyst的工作)以industry排序
. 鍥磋鎴戜滑@1point 3 acres
careerbuilder_industry.png
回复 支持 反对

使用道具 举报

jessicalmj 发表于 2017-6-15 02:02:19 | 显示全部楼层
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
回复 支持 反对

使用道具 举报

 楼主| DL 发表于 2017-6-23 10:43:49 | 显示全部楼层
jessicalmj 发表于 2017-6-15 02:02
这个挺酷的,看来每个网站的侧重还是不一样的。楼主有分析其他网站吗?Glassdoor, indeed 什么的
. Waral 鍗氬鏈夋洿澶氭枃绔,
indeed 的工作分析 https://jessesw.com/Data-Science-Skills/ 上有。我没有具体搞。 结果应该和CareerBuilder比较接近。
回复 支持 反对

使用道具 举报

本版积分规则

关闭

一亩三分地推荐上一条 /5 下一条

手机版|小黑屋|一亩三分地论坛声明

custom counter

GMT+8, 2017-11-23 15:44

Powered by Discuz! X3

© 2001-2013 Comsenz Inc. Design By HUXTeam

快速回复 返回顶部 返回列表