Overview
This guide covers two Python scripts designed to scrape job listings from LinkedIn and Dice based on specific search criteria. Both scripts use Selenium for web scraping and store the data in a MongoDB database. The LinkedIn scraper targets Oracle-related remote and location-based jobs, while the Dice scraper focuses on Oracle functional roles, distinguishing between remote and non-remote positions.
Prerequisites
- Python 3.6+ installed on your system
- MongoDB Atlas account and a database set up (or a local MongoDB instance)
- Chrome Browser installed (for LinkedIn scraper)
- Required Python Libraries:
selenium
pymongo
webdriver_manager
(for Dice scraper)
- Chromedriver (for LinkedIn scraper, manually downloaded) or automatically managed (for Dice scraper)
- LinkedIn Account (for LinkedIn scraper, as login may be required)
1. LinkedIn Job Scraper
Script Contents
Below is the complete LinkedIn scraper script (linkedin.py
):
Setup
- Install required libraries:
pip install selenium pymongo
- Download Chromedriver compatible with your Chrome version and update the
DRIVER_PATH
:DRIVER_PATH = r"C:\path\to\chromedriver.exe"
- Update the Chrome user data directory (optional, for auto-login):
chrome_options.add_argument(r"--user-data-dir=C:\Users\YourUsername\AppData\Local\Google\Chrome\User Data")
- Ensure your MongoDB URI is correct:
mongo_uri = "mongodb+srv://yourusername:yourpassword@cluster0.bpujz.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
Usage
- Save the script as
linkedin.py
. - Run the script:
python linkedin.py
- If a login screen appears, log in manually and press Enter in the console.
- The script will scrape remote and location-based Oracle jobs and store them in MongoDB.
Customization
- Modify
remote_search_url
andlocation_search_url
for different search criteria:keywords=("New Keywords" OR "Other Terms")
- Adjust
scroll_step
ortime.sleep()
for scrolling speed. - Add fields to the
jobs
dictionary inscrape_jobs()
.
Note: LinkedIn may require manual login. Ensure URLs are valid job search links.
2. Dice Job Scraper
Script Contents
Below is the complete Dice scraper script (dice.py
):
Setup
- Install required libraries:
pip install selenium pymongo webdriver_manager
- No manual Chromedriver download needed;
webdriver_manager
handles it. - Verify your MongoDB URI:
mongo_uri = "mongodb+srv://yourusername:yourpassword@cluster0.bpujz.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0"
Usage
- Save the script as
dice.py
. - Run the script:
python dice.py
- The script will scrape remote and non-remote Oracle functional jobs and store them in MongoDB.
Customization
- Update
remote_url
andnon_remote_url
for different search criteria:q=((NEW AND TERMS) AND (NOT (EXCLUDED)))
- Adjust
time.sleep(5)
for page load wait time. - Modify
job_data
inscrape_dice_jobs()
for additional fields. - Remove
--headless
to see the browser during scraping.
Note: The non_remote_url
is a placeholder. Replace it with a valid Dice search URL for non-remote jobs.
Output
Both scrapers store job data in the jobs
collection within the jobs_db
database in MongoDB. Each entry includes fields like _id
, title
, company
, location
, url
, and more.
Tips
- Test with a smaller
pageSize
(e.g., 10) in Dice URLs initially. - Monitor MongoDB for duplicates (LinkedIn skips, Dice updates).
- Respect website terms of service to avoid bans.