BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Nairobi Stock Exchange Web Scraper

Kennedy Mwaura19 min read • Published May 09, 2023 • Updated Apr 02, 2024
AtlasPython
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Looking to build a web scraper using Python and MongoDB for the Nairobi Stock Exchange? Our comprehensive tutorial provides a step-by-step guide on how to set up the development environment, create a Scrapy spider, parse the website, and store the data in MongoDB.
We also cover best practices for working with MongoDB and tips for troubleshooting common issues. Plus, get a sneak peek at using MongoDB Atlas Charts for data visualization. Finally, enable text notifications using Africas Talking API (feel free to switch to your preferred provider). Get all the code on GitHub and streamline your workflow today!

Prerequisites

The prerequisites below are verified to work on Linux. Implementation on other operating systems may differ. Kindly check installation instructions.

Table of contents

  • What is web scraping?
  • Project layout
  • Project setup
  • Starting a Scrapy project
  • Creating a spider
  • Running the scraper
  • Enabling text alerts
  • Data in MongoDB Atlas
  • Charts in MongoDB Atlas
  • CI/CD with GitHub Actions
  • Conclusion

What is web scraping?

Web scraping is the process of extracting data from websites. It’s a form of data mining, which automates the retrieval of data from the web. Web scraping is a technique to automatically access and extract large amounts of information from a website or platform, which can save a huge amount of time and effort. You can save this data locally on your computer or to a database in the cloud.

What is Scrapy?

Scrapy is a free and open-source web-crawling framework written in Python. It extracts the data you need from websites in a fast and simple yet extensible way. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

What is MongoDB Atlas?

MongoDB Atlas is a fully managed cloud database platform that hosts your data on AWS, Google Cloud, or Azure. It’s a fully managed database as a service (DBaaS) that provides a highly available, globally distributed, and scalable database infrastructure. Read our tutorial to get started with a free instance of MongoDB Atlas.
You can also head to our docs to learn about limiting access to your cluster to specified IP addresses. This step enhances security by following best practices.

Project layout

Below is a diagram that provides a high-level overview of the project.
A diagram view of the project
The diagram above shows how the project runs as well as the overall structure. Let's break it down:
  • The Scrapy project (spiders) crawl the data from the afx (data portal for stock data) website.
  • Since Scrapy is a full framework, we use it to extract and clean the data.
  • The data is sent to MongoDB Atlas for storage.
  • From here, we can easily connect it to MongoDB Charts for visualizations.
  • We package our web scraper using Docker for easy deployment to the cloud.
  • The code is hosted on GitHub and we create a CI/CD pipeline using GitHub Actions.
  • Finally, we have a text notification script that runs once the set conditions are met.

Project setup

Let's set up our project. First, we'll create a new directory for our project. Open your terminal and navigate to the directory where you want to create the project. Then, run the following command to create a new directory and change into it .
Next, we'll create a virtual environment for our project. This will help us isolate our project dependencies from the rest of our system. Run the following command to create a virtual environment. We are using the inbuilt Ppython module venv to create the virtual environment. Activate the virtual environment by running the activate script in the bin directory.
Now, we'll install the required dependencies. We'll use pip to install the dependencies. Run the following command to install the required dependencies:

Starting a Scrapy project

Scrapy is a full framework. Thus, it has an opinionated view on the structure of its projects. It comes with a CLI tool to get started quickly. Now, we'll start a new Scrapy project. Run the following command.
This will create a new directory with the name nse_scraper and a few files. The nse_scraper directory is the actual Python package for our project. The files are as follows:
  • items.py — This file contains the definition of the items that we will be scraping.
  • middlewares.py — This file contains the definition of the middlewares that we will be using.
  • pipelines.py — This contains the definition of the pipelines that we will be using.
  • settings.py — This contains the definition of the settings that we will be using.
  • spiders — This directory contains the spiders that we will be using.
  • scrapy.cfg — This file contains the configuration of the project.

Creating a spider

A spider is a class that defines how a certain site will be scraped. It must subclass scrapy.Spider and define the initial requests to make — and optionally, how to follow links in the pages and parse the downloaded page content to extract data.
We'll create a spider to scrape the afx website. Run the following command to create a spider. Change into the nse_scraper folder that is inside our root folder.
This will create a new file afx_scraper.py in the spiders directory. Open the file and replace the contents with the following code:
Let's break down the code above. First, we import the required modules and classes. In our case, we'll be using _CrawlSpider _and Rule from scrapy.spiders and LinkExtractor from scrapy.linkextractors. We'll also be using BeautifulSoup from bs4 to clean the scraped data.
The AfxScraperSpider class inherits from CrawlSpider, which is a subclass of Spider. The Spider class is the core of Scrapy. It defines how a certain site (or a group of sites) will be scraped. It contains an initial list of URLs to download, and rules to follow links in the pages and extract data from them. In this case, we'll be using CrawlSpider to crawl the website and follow links to the next page.
The name attribute defines the name of the spider. This name must be unique within a project — that is, you can’t set the same name for different spiders. It will be used to identify the spider when you run it from the command line.
The allowed_domains attribute is a list of domains that this spider is allowed to crawl. If it isn’t specified, no domain restrictions will be in place. This is useful if you want to restrict the crawling to a particular domain (or subdomain) while scraping multiple domains in the same project. You can also use it to avoid crawling the same domain multiple times when using multiple spiders.
The start_urls attribute is a list of URLs where the spider will begin to crawl from. When no start_urls are defined, the start URLs are read from the sitemap.xml file (if it exists) of the first domain in the allowed_domains list. If you don’t want to start from a sitemap, you can define an initial URL in this attribute. This attribute is optional and can be omitted.
The user_agent attribute is used to set the user agent for the spider. This is useful when you want to scrape a website that blocks spiders that don't have a user agent. In this case, we'll be using a user agent for Chrome. We can also set the user agent in the settings.py file. This is key to giving the target website the illusion that we are a real browser.
The custom_settings attribute is used to set custom settings for the spider. In this case, we'll be setting the DEPTH_LIMIT to 1 and CLOSESPIDER_PAGECOUNT to 1. The DEPTH_LIMIT attribute limits the maximum depth that will be allowed to crawl for any site. Depth refers to the number of page(s) the spider is allowed to crawl. The CLOSESPIDER_PAGECOUNT attribute is used to close the spider after crawling the specified number of pages.
The rules attribute defines the rules for the spider. We'll be using the Rule class to define the rules for extracting links from a page and processing them with a callback, or following them and scraping them using another spider.
The Rule class takes a LinkExtractor object as its first argument. The LinkExtractor class is used to extract links from web pages. It can extract links matching specific regular expressions or using specific attributes, such as href or src.
The deny argument is used to deniesy the extraction of links that match the specified regular expression. The callback argument specifiesis used to specify the callback function to be called on the response of the extracted links.
The follow argument specifies whether the links extracted should be followed or not. We'll be using the callback argument to specify the callback function to be called on the response of the extracted links. We'll also be using the follow argument to specify whether the links extracted should be followed or not.
We then define a parse_item function that takes response as an argument. The parse_item function is used to parses the response and extracts the required data. We'll use the xpath method to extract the required data. The xpath method extracts data using XPath expressions.
We get xpath expressions by inspecting the target website. Basically, we right-click on the element we want to extract data from and click on inspect. This will open the developer tools. We then click on the copy button and select copy xpath. Paste the xpath expression in the xpath method.
The re method extracts data using regular expressions. We then use the clean_stock_symbol, clean_stock_name, and clean_stock_price functions to clean the extracted data. Use the zip function to combine the extracted data into a single list. Then, use a for loop to iterate through the list and yield the data to Scrapy.
The clean_stock_symbol, clean_stock_name, and clean_stock_price functions are used to clean the extracted data. The clean_stock_symbol function takes the raw symbol as an argument. BeautifulSoup class cleans the raw symbol. It then uses the split method to split the cleaned symbol into a list. An if statement checks if the length of the list is greater than 1. If it is, it returns the second item in the list. If it isn't, it returns None.
The clean_stock_name function takes the raw name as an argument. It uses the BeautifulSoup class to clean the raw name. It then uses the split method to split the cleaned name into a list. Again, an if statement will check if the length of the list is greater than 1. If it is, it returns the first item in the list. If it isn't, it returns None. The clean_stock_price function takes the raw price as an argument. It then uses the BeautifulSoup class to clean the raw price and return the cleaned price.
The clean_stock_change function takes the raw change as an argument. It uses the BeautifulSoup class to clean the raw change and return the cleaned data.

Updating the items.py file

Inside the root of our project, we have the items.py file. An item is a container which will be loaded with the scraped data. It works similarly to a dictionary with additional features like declaring its fields and customizing its export. We'll be using the Item class to create our items. The Item class is the base class for all items. It provides the general mechanisms for handling data from scraped pages. It’s an abstract class and cannot be instantiated directly. We'll be using the Field class to create our fields.
Add the following code to the nse_scraper/items.py file:
The NseScraperItem class is creates our item. The ticker_symbol, stock_name, stock_price, and stock_change fields store the ticker symbol, stock name, stock price, and stock change respectively. Read more on items here.

Updating the pipelines.py file

Inside the root of our project, we have the pipelines.py file. A pipeline is a component which processes the items scraped from the spiders. It can clean, validate, and store the scraped data in a database. We'll use the Pipeline class to create our pipelines. The Pipeline class is the base class for all pipelines. It provides the general methods and properties that the pipeline will use.
Add the following code to the pipelines.py file:
First, we import the pymongo module. We then import the DropItem class from the scrapy.exceptions module. Next, import the NseScraperItem class from the items module.
The NseScraperPipeline class creates our pipeline. The collection variable store the name of the collection we'll be using. The init method initializes the pipeline. It takes the mongodb_uri and mongo_db as arguments. It then uses an if statement to check if the mongodb_uri is set. If it isn't, it raises a ValueError. Next, it uses an if statement to check if the mongo_db is set. If it isn't, it raises a ValueError.
The from_crawler method creates an instance of the pipeline. It takes the crawler as an argument. It then returns an instance of the pipeline. The open_spider method opens the spider. It takes the spider as an argument. It then creates a MongoClient instance and stores it in the client variable. It uses the client instance to connect to the database and stores it in the db variable.
The close_spider method closes the spider. It takes the spider as an argument. It then closes the client instance. The clean_stock_data method cleans the scraped data. It takes the item as an argument. It then uses an if statement to check if the ticker_symbol is None. If it is, it raises a DropItem. Next, it uses an if statement to check if the stock_name is None. If it is, it raises a DropItem. It then uses an if statement to check if the stock_price is None. If it is, it raises a DropItem. If none of the if statements are true, it returns the item.
The process_item method processes the scraped data. It takes the item and spider as arguments. It then uses the clean_stock_data method to clean the scraped data. It uses the dict function to convert the item to a dictionary. Next, it prints the data to the console. It then uses the db instance to insert the data into the database. It returns the item.

Updating the

settings.py

file

Inside the root of our project, we have the settings.py file. This file is used to stores our project settings. Add the following code to the settings.py file:
First, we import the os and load_dotenv modules. We then call the load_dotenv function. It takes no arguments. This function loads the environment variables from the .env file.
nse_scraper.spiders. We append the MONGODB_URI variable and set it to the MONGODB_URI environment variable. Next, we create the MONGODB_DATABASE variable and set it to the MONGO_DATABASE environment variable.
After, we create the ITEM_PIPELINES variable and set it to nse_scraper.pipelines.NseScraperPipeline. We then create the LOG_LEVEL variable and set it to INFO. The DEFAULT_REQUEST_HEADERS variable is set to a dictionary. Next, we create the HTTPCACHE_ENABLED variable and set it to True.
Change the HTTPCACHE_EXPIRATION_SECS variable and set it to 360. Create the HTTPCACHE_DIR variable and set it to httpcache. Finally, create the HTTPCACHE_STORAGE variable and set it to scrapy.extensions.httpcache.FilesystemCacheStorage.

Project structure

The project structure is as follows:

Running the scraper

To run the scraper, we'll need to open a terminal and navigate to the project directory. We'll then need to activate the virtual environment if it's not already activated. We can do this by running the following command:
Create a .env file in the root of the project (in /nse_scraper/). Add the following code to the .env file:
Add your MongoDB URI, database name, Africas Talking username, API key, and mobile number to the .env file for your MongoDB URI. You can use the free tier of MongoDB Atlas. Get your URI over on the Atlas dashboard, under the connect button. It should look something like this:
We need to run the following command to run the scraper while in the project folder:
(/nse_scraper /):

Enabling text alerts (using Africas Talking)

Install the africastalking module by running the following command in the terminal:
Create a new file called stock_notification.py in the nse_scraper directory. Add the following code to the
stock_notification.py
file:
The code above imports the africastalking module. Import the os and load_dotenv modules. We proceed to call the load_dotenv function. It takes no arguments. This function loads the environment variables from the .env file.
  • We create the at_username variable and set it to the at_username environment variable. We then create the at_api_key variable and set it to the at_api_key environment variable. Create the mobile_number variable and set it to the mobile_number environment variable. And create the mongo_uri variable and set it to the MONGODB_URI environment variable.
  • We initialize the africastalking module by passing the at_username and at_api_key variables as arguments. Create the sms variable and set it to at.SMS. Create the account variable and set it to at.Application.
  • Create the ticker_data variable and set it to an empty list. Create the stock_notification function. It takes two arguments: message and number. We then try to send the message to the number and print the response. Look for any exceptions and display them.
  • We created the stock_query function. We then create the client variable and set it to a pymongo.MongoClient object. Create the db variable and set it to the nse_data database. Then, create the collection variable and set it to the stock_data collection, and create the ticker_data variable and set it to the collection.find_one method. It takes a dictionary as an argument.
The stock_name variable is set to the name key in the ticker_data dictionary. Create the stock_price variable and set it to the price key in the ticker_data dictionary. Create the sms_data variable and set it to a dictionary. It contains the stock_name and stock_price variables.
The message variable is set to a string containing the stock name and price. We check if the stock price is greater than or equal to 38. If it is, we call the stock_notification function and pass the message and mobile_number variables as arguments. If it isn't, we print a message to the console.
Close the connection to the database and return the sms_data variable. Call the stock_query function.
We need to add the following code to the afx_scraper.py file:
If everything is set up correctly, you should something like this:

Data in MongoDB Atlas

We need to create a new cluster in MongoDB Atlas. We can do this by:
  • Clicking on the Build a Cluster button.
  • Selecting the Shared Clusters option.
  • Selecting the Free Tier option.
  • Selecting the Cloud Provider & Region option.
  • Selecting the AWS option. (I selected the AWS Cape Town option.)
  • Selecting the Cluster Name option.
  • Giving the cluster a name. (We can call it nse_data.)
Let’s configure a user to access the cluster by following the steps below:
  • Select the Database Access option.
  • Click on the Add New User option.
  • Give the user a username. (I used nse_user.).
  • Give the user a password. (I used nse_password).
  • Select the Network Access option.
  • Select the Add IP Address option.
  • Select the Allow Access from Anywhere option.
  • Select the Cluster option. We'll then need to select theCreate Cluster option.
Click on the Collections option and then on the + Create Database button. Give the database a name. We can call it nse_data. Click on the + Create Collection button. Give the collection a name. We can call it stock_data. If everything is set up correctly, you should see something like this:
Database records displayed in MongoDB Atlas
If you see an empty collection, rerun the project in the terminal to populate the values in MongoDB. Incase of an error, read through the terminal output. Common issues could be:
  • The IP aAddress was not added in the dashboard.
  • A lLack of/iIncorrect credentials in your .env file.
  • A sSyntax error in your code.
  • A poorCheck your internet connection.
  • A lLack of appropriate permissions for your user.

Metrics in MongoDB Atlas

Let's go through how to view metrics related to our database(s).
  • Click on the **Metrics option.
  • Click on the + Add Metric button.
  • Select the Database option.
  • Select the nse_data option.
  • Select the Collection option.
  • Select the stock_data option.
  • Select the Metric option.
  • Select the Documents option.
  • Select the Time Range option.
  • Select the Last 24 Hoursoption.
  • Select the Granularity option.
  • Select the 1 Hour option.
  • Click on the Add Metric button.
If everything is set up correctly, it will look like this:
Cluster metrics shown in MongoDB Atlas

Charts in MongoDB Atlas

MongoDB Atlas offers charts that can be used to visualize the data in the database. Click on the Charts option. Then, click on the + Add Chart button. Select the Database option. Below is a screenshot of sample charts for NSE data:
Data displayed in MongoDB Atlas Charts

Version control with Git and GitHub

Ensure you have Git installed on your machine, along with a GitHub account.
Run the following command in your terminal to initialize a git repository:
Create a .gitignore file. We can do this by running the following command in our terminal:
Let’s add the .env file to the .gitignore file. Add the following code to the .gitignore file:
Add the files to the staging area by running the following command in our terminal:
Commit the files to the repository by running the following command in our terminal:
Create a new repository on GitHub by clicking on the + icon on the top right of the page and selecting New repository. Give the repository a name. We can call it nse-stock-scraper. Select Public as the repository visibility. Select Add a README file and Add .gitignore and select Python from the dropdown. Click on the Create repository button.
Add the remote repository to our local repository by running the following command in your terminal:
Push the files to the remote repositor by running the following command in your terminal:

CI/CD with GitHub Actions

Create a new folder — .github — and a workflows folder inside, in the root directory of the project. We can do this by running the following command in our terminal. Inside the workflows file, we'll need to create a new file called scraper-test.yml. We can do this by running the following command in our terminal:
Inside the
scraper-test.yml
file, we'll need to add the following code:
Let's break down the code above. We create a new workflow called Scraper test with MongoDB. We then set the on event to push. Create a new job called build. Set the runs-on to ubuntu-latest. Set the strategy to a matrix. It contains the python-version and mongodb-version variables. Set the python-version to 3.8, 3.9, and 3.10. Set the mongodb-version to 4.4, 5.0, and 6.0.
Create a new step called Checkout. Set the uses to actions/checkout@v2. Create a new step called Set up Python ${{ matrix.python-version }} and set the uses to actions/setup-python@v1. Set the python-version to ${{ matrix.python-version }}. Create a new step called Set up MongoDB ${{ matrix.mongodb-version }}. This sets up different Python versions and MongoDB versions for testing.
The Install dependencies step installs the dependencies. Create a new step called Lint with flake8. This step lints the code. Create a new step called scraper-test. This step runs the scraper and tests it.
Commit the changes to the repository by running the following command in your terminal:
Go to the Actions tab on your repository. You should see something like this:
Displaying the build process

Conclusion

In this tutorial, we built a stock price scraper using Python and Scrapy. We then used MongoDB to store the scraped data. We used Africas Talking to send SMS notifications. Finally, we implemented a CI/CD pipeline using GitHub Actions.
There are definite improvements that can be made to this project. For example, we can add more stock exchanges. We can also add more notification channels. This project should serve as a good starting point.
Thank you for reading through this far., I hope you have gained insight or inspiration for your next project with MongoDB Atlas. Feel free to comment below or reach out for further improvements. We’d love to hear from you! This project is oOpen sSource and available on GitHub —, clone or fork it!, I’m excited to see what you build.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

ELT MongoDB Data Using Airbyte


Nov 16, 2022 | 2 min read
Tutorial

Working with MongoDB Charts and the New JavaScript SDK


Apr 02, 2024 | 10 min read
Tutorial

Migrating PostgreSQL to MongoDB Using Confluent Kafka


Apr 10, 2024 | 10 min read
Tutorial

Building an AI Agent With Memory Using MongoDB, Fireworks AI, and LangChain


Apr 23, 2024 | 21 min read
Table of Contents