Capturing Hacker News Mentions with Node.js and MongoDB
Rate this tutorial
The problem is that you're going to find too much information on Hacker News without a particularly easy way to filter through it to find the topics that you're interested in. Let's say, for example, you want to know information about Bitcoin as soon as it is shared. How would you do that on the Hacker News website?
In this tutorial, we're going to learn how to parse through Hacker News data as it is created, filtering for only the topics that we're interested in. We're going to do a sentiment analysis on the potential matches to rank them, and then we're going to store this information in
so we can run reports from it. We're going to do it all with Node.js and some simple pipelines.
You won't need a Hacker News account for this tutorial, but you will need a few things to be successful:
We'll be storing all of our matches in MongoDB Atlas. This will make it easier for us to run reports and not depend on looking at logs or similarly structured data.
Before we get into the interesting code and our overall journey toward understanding and storing the Hacker News data as it comes in, we need to bootstrap our project.
On your computer, create a new project directory and execute the following commands:
We'll be using the
package to consume raw data from the API. As we progress, you'll notice that we're working with streams of data rather than one-off requests to the API. This means that the data that we receive might not always be complete. To make sense of this, we use the
package to get useable JSON from the stream. Since we're working with streams, we need to be able to use pipelines, so we can't just pass our JSON data through the pipeline as is. Instead, we need to use
to filter and manipulate our JSON data before passing it to another stage in the pipeline. Finally, we have
for doing a sentiment analysis on our data.
We'll reiterate on a lot of these packages as we progress.
Before moving to the next step, make sure you create a main.js file in your project. This is where we'll add our code, which you'll see isn't too many lines.
We're going to start by adding our downloaded dependencies to our code file and connecting to a MongoDB cluster or instance.
Open the project's main.js file and add the following code:
In the above code, we've added all of our downloaded dependencies, plus some. Remember we're working with a stream of data, so we need to use pipelines in Node.js if we want to work with that data in stages.
When we run the application, we are connecting to a MongoDB instance or cluster as defined in our environment variables. The
ATLAS_URIvariable would look something like this:
You can find the connection string in your MongoDB Atlas dashboard.
Test that the application can connect to the database by executing the following command:
If you don't want to use environment variables, you can hard-code the value in your project or use a configuration file. I personally prefer environment variables because we can set them externally on most cloud deployments for security (and there's no risk that we accidentally commit them to GitHub).
At this point, the code we have will connect us to MongoDB. Now we need to focus on streaming the Hacker News data into our application and filtering it for the data that we actually care about.
Let's make the following changes to our main.js file:
In the above code, after we connect, we create a pipeline of stages to complete. The first stage is a simple GET request to the streaming API endpoint. The results from our request should be JSON, but since we're working with a stream of data rather than expecting a single response, our result may be malformed depending on where we are in the stream. This is normal.
By the time the
ndjson.parsestage completes, we should have properly formed JSON to work with. This means we need to analyze it to see if it is JSON data we want to keep or toss. Remember, the streaming API gives us all data coming from Hacker News, not just what we're looking for. To filter, we can use the
filterstage, we are returning true if the body of the Hacker News mention includes "bitcoin" or the title of the thread includes the "bitcoin" term. This means that this particular entry is what we're looking for and it will be passed to the next stage in the pipeline. Anything that doesn't match will be ignored for future stages.
At this point, we should have matches on Hacker News data that we're interested in. However, Hacker News has a ton of bots and users posting potentially irrelevant data just to rank in people's searches. It's a good idea to analyze our match and score it to know the quality. Then later, we can choose to ignore matches with a low score as they will probably be a waste of time.
So let's adjust our pipeline a bit in the main.js file:
We first initialize the package through the following line:
When we manipulate the matched object, we are performing a sentiment analysis on the body of the mention. At this point, we don't care what the score is, but we plan to add it to the data which we'll eventually store in MongoDB.
The object as of now might look something like this:
The only modification we've made to the data as of right now is the addition of a score from our sentiment analysis.
It's important to note that our data is not yet inside of MongoDB. We're just at the stage where we've made modifications to the stream of data that could be a match to our interests.
With the data formatted how we want it, we can focus on storing it within MongoDB and querying it whenever we want.
Let's make a modification to our pipeline:
We're doing another transformation on our object. This could have been merged with the earlier transformation stage, but for code cleanliness, we are breaking them into two stages.
In this final stage, we are doing an
insertOneoperation with the MongoDB Node.js driver. We're taking the
rowof data from the previous stage and we're adding two new fields to the object before it is inserted. We're doing this so we have quick access to the URL and don't have to rebuild it later.
If we ran the application, it would run forever, collecting any data posted to Hacker News that matched our filter.
If we wanted to query our data within MongoDB, we could use an MQL query like the following:
You just saw an example of using MongoDB and Node.js for capturing relevant data from Hacker News as it happens live. This could be useful for keeping your own feed of particular topics or it can be extended for other use-cases such as monitoring what people are saying about your brand and using the code as a feedback reporting tool.