Python Script 7: Scraping tweets using BeautifulSoup

beautifulsoup scraping 1 42835

Twitter is one of the most popular social networking services used by most prominent people of world. Tweets can be used to perform sentimental analysis.

In this article we will see how to scrape tweets using BeautifulSoup. We are not using Twitter API as most of the APIs have rate limits.

You can download all the pictures of any Instagram user in just few lines of codes. We converted the script into reusable python package to make things easy.

Setup:

Create a virtual environment. If you are not in the habit of working with virtual environments, please stop immediately and read this article on virtual environments first.

Once virtual environment is created and activated, install the dependencies in it.

pip install beautifulsoup4==4.6.0 bs4==0.0.1 requests==2.18.4

Analysing Twitter Web Requests:

Lets say we want to scrape all the tweets made by Honourable Prime Minister of India, Shri Narendra Modi.

Go to the browser, I am using Chrome, press F12 to open the debugging tool.

Now go the the URL https://twitter.com/narendramodi. In the network tab of debugging tool, you will see the response of request made to URL /narendramodi.

Response is an HTML page. We will convert this HTML response into a BeautifulSoup object and will extract the tweets.

python script 7 scraping tweets using beautifulsoup

If you scroll down the page to load more tweets, you will see more requests being sent where response is not simple HTML but is in JSON format.

Extracting tweets from HTML content:

First inspect the tweet element on web page. You will see that all the tweets are enclosed in li HTML tag. Actual tweet text is inside a p tag which is the descendent of li tag.

We will first get all the li tags and then p tags from each li tag. Text contained in the p tag is what we need.

Code to start with:

# script to scrape tweets by a twitter user.
# Author - ThePythonDjango.Com
# dependencies - BeautifulSoup, requests

from bs4 import BeautifulSoup
import requests
import sys
import json


def usage():
    msg = """
    Please use the below command to use the script.
    python script_name.py twitter_username
    """
    print(msg)
    sys.exit(1)


def get_username():
    # if username is not passed
    if len(sys.argv) < 2:
        usage()
    username = sys.argv[1].strip().lower()
    if not username:
        usage()

    return username


def start(username = None):
    username = get_username()
    url = "http://www.twitter.com/" + username
    print("\n\nDownloading tweets for " + username)
    response = None
    try:
        response = requests.get(url)
    except Exception as e:
        print(repr(e))
        sys.exit(1)
    
    if response.status_code != 200:
        print("Non success status code returned "+str(response.status_code))
        sys.exit(1)

    soup = BeautifulSoup(response.text, 'lxml')

    if soup.find("div", {"class": "errorpage-topbar"}):
        print("\n\n Error: Invalid username.")
        sys.exit(1)

    tweets = get_tweets_data(username, soup)

We will start with start function. First collect the username from command line and then send the request to twitter page.

If there is no exception and status code returned in response is 200 i.e. success, proceed otherwise exit.

Convert the response text into BeautifulSoup object and see if there is any div tag in the HTML with class errorpage-topbar. If yes that means the username is invalid. Although this check is not required because in case of invalid username, 404 status is returned which will be checked in status_code check condition.

Extract tweet text:

def get_this_page_tweets(soup):
    tweets_list = list()
    tweets = soup.find_all("li", {"data-item-type": "tweet"})
    for tweet in tweets:
        tweet_data = None
        try:
            tweet_data = get_tweet_text(tweet)
        except Exception as e:
            continue
            #ignore if there is any loading or tweet error

        if tweet_data:
            tweets_list.append(tweet_data)
            print(".", end="")
            sys.stdout.flush()

    return tweets_list


def get_tweets_data(username, soup):
    tweets_list = list()
    tweets_list.extend(get_this_page_tweets(soup))

As discussed, we first find out all li tags and then for each element we try to get tweet text out of that li tag.

We keep printing a dot on screen every time a tweet is scrapped successfully to show the progress otherwise user may think that script is doing nothing or is hanged.

def get_tweet_text(tweet):
    tweet_text_box = tweet.find("p", {"class": "TweetTextSize TweetTextSize--normal js-tweet-text tweet-text"})
    images_in_tweet_tag = tweet_text_box.find_all("a", {"class": "twitter-timeline-link u-hidden"})
    tweet_text = tweet_text_box.text
    for image_in_tweet_tag in images_in_tweet_tag:
        tweet_text = tweet_text.replace(image_in_tweet_tag.text, '')

    return tweet_text

We sometimes have images inside tweets, we will discard those images as of now. We do this by getting image tags inside tweets and replacing image text by empty string.

Scrapping more tweets:

So far we were able to get tweets from first page. As we load more pages, when scrolling down, we get JSON response. We need to parse JSON response, which is slightly different.

def get_tweets_data(username, soup):
    tweets_list = list()
    tweets_list.extend(get_this_page_tweets(soup))

    next_pointer = soup.find("div", {"class": "stream-container"})["data-min-position"]

    while True:
        next_url = "https://twitter.com/i/profiles/show/" + username + \
                   "/timeline/tweets?include_available_features=1&" \
                   "include_entities=1&max_position=" + next_pointer + "&reset_error_state=false"

        next_response = None
        try:
            next_response = requests.get(next_url)
        except Exception as e:
            # in case there is some issue with request. None encountered so far.
            print(e)
            return tweets_list

        tweets_data = next_response.text
        tweets_obj = json.loads(tweets_data)
        if not tweets_obj["has_more_items"] and not tweets_obj["min_position"]:
            # using two checks here bcz in one case has_more_items was false but there were more items
            print("\nNo more tweets returned")
            break
        next_pointer = tweets_obj["min_position"]
        html = tweets_obj["items_html"]
        soup = BeautifulSoup(html, 'lxml')
        tweets_list.extend(get_this_page_tweets(soup))

    return tweets_list

First we check if there are more tweets. If yes then we find the next pointer and create the next URL. Once JSON is received, we take out the items_html part and repeat the process of creating soup and fetching tweets. We keep doing this until there are no more tweets to scrap. We know this by looking at the variable has_more_items and min_position in JSON response.

Complete script:

Now all the functions are completed. Let put them together.

Download the complete script from GitHub.

Running the script:

Assuming you have installed dependencies in virtual environment, lets run the script.

(scrappingvenv) rana@Nitro:python_scripts$ python tweets_scrapper.py narendramodi


Downloading tweets for narendramodi
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
No more tweets returned

Dumping data in file narendramodi_twitter.json
844 tweets dumped.
(scrappingvenv) rana@Nitro:python_scripts$

You might introduce some wait between requests if you get any rate limit errors.

Dumping data in file:

You might want to dump the data in text file. I prefer dumping data in JSON format.

# dump final result in a json file
def dump_data(username, tweets):
    filename = username+"_twitter.json"
    print("\nDumping data in file " + filename)
    data = dict()
    data["tweets"] = tweets
    with open(filename, 'w') as fh:
        fh.write(json.dumps(data))

    return filename

Let us know if you face any issues.

beautifulsoup scraping 1 42835

1 comment on 'Python Script 7: Scraping Tweets Using Beautifulsoup'

Kreem March 10, 2020, 9:56 a.m.

Dear, Thank you so much for your tutorial it was so helpful! I'm a bit new to web scraping, but I tried to make a twitter scraper where I can give a list of names and that script automate the scraping and upload data to database. Everything is working fine except that I want to make it read from "/with_replies" to get the replies of the users too. I will highly appreciate it if you could just give me a hint on how that would be done. Thank you in advance

Python Script 14: Scraping news headlines using python beautifulsoup

Scraping news headlines using python beautifulsoup, web scraping using python, python script to scrape news, web scraping using beautifulsoup, news headlines scraping using python, python programm to get news headlines from web...

Read Full Article

py_instagram_dl - The Python Package to Download All pictures of an Instagram User

Download all instagram images for any user using this python package....

Read Full Article

How to create completely automated telegram channel with python

Creating a completely automated telegram channel to generate and post content using python code on regular basis. Automating the Telegram channel using python script...

Read Full Article

Python Script 2 : Crawling all emails from a website

Website crawling for email address, web scraping for emails, data scraping and fetching email adress, python code to scrape all emails froma websites, automating the email id scraping using python script, collect emails using python script...

Read Full Article