1000 Python Questions
Get 1 Python question daily. Join this telegram channel https://t.me/python1000questions
Advertise with us
scraping email   6   13872
Python Script 2 : Crawling all emails from a website

This is the second article in the series of python scripts.

In this article we will see how to crawl all pages of a website and fetch all the emails.

Important: Please note that some sites may not want you to crawl their site. Please honour their robot.txt file. In some cases it may lead to legal action. This article is only for educational purpose. Readers are requested not to misuse it. 


Instead of explaining the code separately, I have embedded the comments over the source code lines. I have tried to explain the code wherever I felt the requirement.

Please comment in case of any query. You might need to install some packages like requests  and BeautifulSoup  for this script to work. It is recommended that you create a virtual environment and install packages in it.

import re
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup

# starting url. replace google with your own url.
starting_url = 'http://www.miet.ac.in'

# a queue of urls to be crawled
unprocessed_urls = deque([starting_url])

# set of already crawled urls for email
processed_urls = set()

# a set of fetched emails
emails = set()

# process urls one by one from unprocessed_url queue until queue is empty
while len(unprocessed_urls):

    # move next url from the queue to the set of processed urls
    url = unprocessed_urls.popleft()
    processed_urls.add(url)

    # extract base url to resolve relative links
    parts = urlsplit(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
    print("Crawling URL %s" % url)
    try:
        response = requests.get(url)
    except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # ignore pages with errors and continue with next url
        continue

    # extract all email addresses and add them into the resulting set
    # You may edit the regular expression as per your requirement
    new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    emails.update(new_emails)
    print(emails)
    # create a beutiful soup for the html document
    soup = BeautifulSoup(response.text, 'lxml')

    # Once this document is parsed and processed, now find and process all the anchors i.e. linked urls in this document
    for anchor in soup.find_all("a"):
        # extract link url from the anchor
        link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links (starting with /)
        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link
        # add the new url to the queue if it was not in unprocessed list nor in processed list yet
        if not link in unprocessed_urls and not link in processed_urls:
            unprocessed_urls.append(link)
 

Constructive feedback is always welcomed. 

scraping email   6   13872

Related Articles:
Python Script 10: Collecting one million website links
Collecting one million website links by scraping using requests and BeautifulSoup in Python. Python script to collect one million website urls, Using beautifulsoup to scrape data, Web scraping using python, web scraping using beautifulsoup, link collection using python beautifulsoup...
How to send email from Python and Django using Office 365
How to send an email via office 365 in python and Django, Automating the email sending process using Django, Office 365 credentials to send email using Django, Python script to send emails via office 365, Automating office 365 using python, Python script to send emails via office365...
Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy
Scraping large amount of tweets within minutes using celery and python, RabbitMQ and docker cluster with Python, Scraping huge data quickly using docker cluster with TOR, using rotating proxy in python, using celery rabbitmq and docker cluster in python to scrape data, Using TOR with Python...
How to create completely automated telegram channel with python
Creating a completely automated telegram channel to generate and post content using python code on regular basis. Automating the Telegram channel using python script...

6 thoughts on 'Python Script 2 : Crawling All Emails From A Website'
Mrcee :
Hi there, How long does this script take to complete crawling? The crawl does not seem to be contained to the website?If you could advise it would be much appreciated.
Admin :
yes, this script is not contained to one site only. it will crawl any other website as well if that is linked to current website. This will keep running it have no more links to process. If there is any specific issue, le us know.

Faheem :
can you modify script that scraped emails it will export it in csv format alot of thanks in advance, appreciate your guidance.Regards,FaheemSkype mfaheem2009

Mitchell :
What version of Python is required to run this code, I have version 3.7 and it doesent seem to work? I dont know what I am doing wrong
Admin :
I used python 3.4. What is the error you are getting. please share pastebin

Ceskoslovensko :
for me it was working for python 3.7 just updated exceptions try: response = requests.get(url.strip()) except (requests.exceptions.InvalidSchema, requests.exceptions.InvalidURL): continue except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): continue except (UnicodeEncodeError, UnicodeError): continue

Robert :
Partially related to your program. But is it possible to automate this?
Admin :
yes it is totally possible to automate this. Just add a cron to run this automatically.

Rebecca Young :
Can you narrow down your search or is it random, somehow if we all use the scripts this way, won't we be getting the same emails or recurring ones? Kindly rsvp adminThanks
Admin :
it is domain-specific. you can start with different website to get different emails

Leave a comment:


*All Fields are mandatory. **Email Id will not be published publicly.

SUBSCRIBE
Please subscribe to get the latest articles in your mailbox.

© 2017-2020 Python Circle   Contact   Sponsor   Archive   Sitemap