Web Scraping With A Browser Console

Last updated: 18 February 2024

Web Scraping With A Browser Console_Large.png

Web scraping through the browser console with JavaScript can be an great alternative to using libraries like Cheerio (https://cheerio.js.org/) or Puppeteer (https://pptr.dev/). They are suited for setting up large or more complicated web scraping jobs, and due to that they can take longer to setup to scrape something basic or simple.

How can you scrape through the browser console?

Most browsers today (Chrome, Firefox, etc.) let you access the console mainly to test and debug sites or directly execute JavaScript code directly in the current browser window. To scrape data from the browser console, we can write small javascript scripts that will get the values we want in the HTML through the Document Object Model (DOM) and output the data formatted in JSON to then save it. This method of web scraping is mainly useful when we are only scraping data from 1 window, as we can only access the HTML in the current page, but this also means if there are any login forms or security challenges in the way of the data, we can skip them and extract the information we need.

Web Scraping With A Browser Console Tutorial

In this tutorial, we will be demonstrating how to use JavaScript in a browser console to extract postcode data from The Handbook’s blog (https://www.thehandbook.com/londons-best-private-members-clubs-2021/). We will extracting every single postcode mentioned in the blog without manually copying anything ourselves.

Setup

We will start off by making use of a popular browser like Chrome or Firefox, visit the page and find the data we want to be scraping. site_to_scrape.png

After finding the text we want to extract from our page, we want to right click on the one of the text which will open the browser console and the HTML code for the page. When looking at the code, we can see the postcodes are all present in different section tags with the class editorial-section, to grab the postcode of each section we want to start of by grabbing the container of all the sections. console.png After finding the container for all the sections we will grab the CSS path of it and use it to get all the sections using query selectors:

var container = document.querySelector("html body.thb_editorial_post-template-default.single.postid-100339853.single-thb_editorial_post.mm-wrapper.s-sticky-site-nav.scrolling div#MM_Page.mm-page.mm-slideout div.editorial-container article.editorial-paper div.editorial-content.editorial-typography");
var sections = container.querySelectorAll("div");
console.log(sections)

When outputting the variable holding all the section we should get the following indicating the amount of sections found throughout the HTML. console1.png

After we stored the sections we want to go through go from a section to where the address is at, we will have to manually write the CSS path for this section div div div:nth-child(2) div div p. Now that we can access the data inside of a section, we want to try it out for all the sections stored which is where forEach() comes in as it will loop through each section stored.

sections.forEach(sectionData => {
	var insideContainer = sectionData.querySelector("section div div div:nth-child(2) div div p");
	console.log(insideContainer);
});

However one issue we may come up with is that not all the sections contain our needed postcode data and instead some may contain other text so the CSS path we made earlier may not work and will see this in the console: console2.png

We can fix that by making a null check before console logging our section and instead of logging the HTML code we can try get the raw HTML code inside by adding innerHTML. console3.png

In our console now we will see the address in the raw HTML format inside the CSS path we specified, to grab only the postcode, we could extract the data and format it outside of the browser but if we know the Regex (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Cheatsheet) for the postcode we can just extract the postcode. console4.png

After extracting only the data we need, we can focus on formatting it more for outputting it as a JSON, so we will start off with a JS array to hold our postcodes and format it as JSON at the end of the forEach() loop.

var postCodeArray = [];

var container = document.querySelector("html body.thb_editorial_post-template-default.single.postid-100339853.single-thb_editorial_post.mm-wrapper.s-sticky-site-nav.scrolling div#MM_Page.mm-page.mm-slideout div.editorial-container article.editorial-paper div.editorial-content.editorial-typography");
var sections = container.querySelectorAll("div");

sections.forEach(sectionData => {
    var insideContainer = sectionData.querySelector("section div div div:nth-child(2) div div p");
    
    if (insideContainer != null) {
        var postCode = insideContainer.innerHTML.match(/[A-Z]{1,2}[0-9]{1,2}[A-Z]?\s?[0-9][A-Z]{2}/i);
        if (postCode != null) {
            postCodeArray.push(postCode[0])
        }
    };
});

var json = JSON.stringify(postCodeArray);
console.log(json)

console5.png We can now save the outputted JSON data and use it anywhere we need it, or even convert it CSV.