Unveiling the Power of Web Scraping: A Comprehensive Guide

Unveiling the Power of Web Scraping: A Comprehensive Guide

·

20 min read

Data is the new Gold. The ability to harness vast amounts of information from the web can be a game-changer for businesses, researchers, and developers alike. Enter web scraping—a technique that empowers us to extract, collect, and analyze data from websites automatically. If knowledge is power, then web scraping is the key to unlocking an ocean of knowledge.

Welcome to our blog, where we delve into the world of web scraping. In this comprehensive guide, we will navigate the intricate terrain of web scraping technologies, techniques, and best practices. Let's get started!

What and why is web scraping?

Web scraping is like having a superpower in the digital world. It's the art of automating the process of extracting valuable information from websites, turning the vast expanse of the internet into a treasure trove of data. Imagine a digital detective, tirelessly scouring the web, extracting insights, and presenting them to you in a format of your choosing.

With web scraping, you're not limited to what's readily available - you can gather specific, tailored information from across the internet, transforming it into actionable intelligence. It's a bridge between the human mind and the immense data-scape, enabling us to unlock hidden patterns, trends, and opportunities.

In essence, web scraping is the key that opens the door to a wealth of knowledge, empowering businesses, researchers, and innovators to make informed decisions and uncover hidden gems in the vast digital landscape.

  • Price comparison: track the prices of your competitors in real-time and with the ability to adjust your own prices on the fly. You can even tell your own customers what your competitors are up to so that they see the advantages of buying from you instead.

  • Lead generation: generate smart leads by scraping publicly available contact information and social media platform profiles to find new customers and potential business leads.

  • Content aggregation: aggregate content to create new uses for data, make data easier to read or add value by notifying users when prices or content changes.

  • Market research: gain market insights by scraping data about your business, customer demand, feedback in the wild, or even identify opportunities in the real world by analyzing demographic changes and trends.

  • Product development: gather product data to help you develop your new product or build a game-changing tool.

  • Sentiment analysis: turn opinions into data. Supply your sentiment analysis projects with real-time web data no matter whether it’s tweets, comments, product reviews, or news articles.

  • Machine learning: web data is the fuel of AI, machine learning, and LLMs. Get the data you need for your ML projects.

Now we know what and why about scraping let's talk about how🛠️.

From Source to Dataset: A Four-Step Guide to Web Data Extraction

  1. Target Acquisition: Determine the content you want to extract.

  2. Web Scraping Tool or Library:

    1. HTTP libraries (requests in Python, HTTPClient in Java, etc.)

    2. HTML parsers (BeautifulSoup in Python, Jsoup in Java, etc.)

    3. Browser automation tools (Selenium, Puppeteer)

  3. Data Extraction: Hi I am Bot! Yeah, you guessed it right We deal with a lot of bots here.

  4. Data Storage and Post-Processing: Store data in CSV, JSON, or a database. Analyze for insights, ensuring ethical scraping practices and site compliance.

Breaking Barriers: Strategies for Effective Web Scraping

  1. Website Changes: Websites frequently undergo updates and changes in their structure, CSS classes, or HTML elements. Adapting the scraper to handle these changes can be time-consuming.

  2. Anti-Scraping Measures: Websites implement various anti-scraping techniques, such as CAPTCHAs, rate limiting, IP blocking, and honeypots. Overcoming these measures requires creative solutions.

  3. Dynamic Content Loading: Some websites load content dynamically using JavaScript or AJAX. This can be challenging to scrape as it may require handling JavaScript execution or waiting for elements to load.

  4. Data Handling and Data Integrity: Managing large datasets requires efficient code to avoid performance issues. Data cleaning and validation are crucial to ensure accuracy and consistency, addressing missing or malformed data.

  5. Ethical Considerations: Adhering to robots.txt guidelines is vital. Prioritize error handling, recovery, scalability, efficiency, and maintainability for responsible web scraping practices.

Anti-Scraping Measures:

CAPTCHA - Completely Automated Public Turing test to tell Computers and Humans Apart. Why CAPTCHA?
It stands as a sentinel against unauthorized access, data scraping, and automated traffic. Its algorithm-generated tests - be they textual, visual, or audio-based - serve as the last line of defence. Mastering CAPTCHAs demands distinctly human skills:

  1. Invariant Recognition: Deciphering CAPTCHAs demands the human ability to identify patterns that remain consistent amidst variation.

  2. Segmentation: Our cognitive prowess excels in breaking down complex elements, a feat that CAPTCHAs challenge computers to replicate.

  3. Parsing Context: Understanding the broader context grants us an edge in comprehending CAPTCHAs, an area where machines are still striving to catch up.

Types of CAPTCHAs

  1. Normal Captcha- distorted image contains text but is readable by humans.

  2. Text Captcha – Question followed by answer eg: 1+1 =?

  3. Key Captcha/ Click Captcha - These involve solving puzzles, image CAPTCHAs, rotating CAPTCHAs, GeeTest CAPTCHAs, and hCaptcha, among others.

How is reCaptcha Integrated into Websites?

  1. Get the CAPTCHA from your favourite service provider

  2. Register Your Website, This will provide you with a site key and secret key.

<head> 
<!-- Add reCAPTCHA script -->
<script src="https://www.google.com/recaptcha/api.js" async defer></script> 
</head>
  1. Add the CAPTCHA Widget to Your Form

     <form action="/submit" method="post">
         <!-- Your form fields go here -->
         <div class="g-recaptcha" data-sitekey="YOUR_SITE_KEY"></div>
         <button type="submit">Submit</button>
     </form>
     <!--Replace YOUR_SITE_KEY with your actual reCAPTCHA site key.-->
    
  2. Verify the CAPTCHA on the Server Side

     import requests
    
     def verify_captcha(recaptcha_response):
         secret_key = "YOUR_SECRET_KEY"
         payload = {'secret': secret_key, 'response': recaptcha_response}
         response = requests.post("https://www.google.com/recaptcha/api/siteverify", data=payload)
         result = response.json()
         return result['success']
     # Replace YOUR_SECRET_KEY with your actual reCAPTCHA secret key.
     # Usage:
     # if verify_captcha(request.form['g-recaptcha-response']):
     #     # CAPTCHA is valid, process the form
     # else:
     #     # CAPTCHA failed, show an error message
    

Use XPaths/CSS selectors and detect reCaptcha by looking for an element with class text/ attribute containing/stating reCAPTCHA This initial step is pivotal in engineering tailored solutions to navigate and successfully bypass CAPTCHAs

Proactive measures outshine reactive solutions every time!🚀

The recommended approach is to prevent CAPTCHA from appearing in the first place and, if blocked, to retry the request. Alternatively, you can solve it, but the success rate and performance are much lower and the cost is significantly higher.

So how to scrap pass CAPTCHA’s

1. Rotate IP proxies, switch user agents, and insert random delays for seamless scraping as the website server triggers an integrated anti-scraping detection service when the same IP starts hitting the servers aggressively.

2. "Robots.txt is a website's rulebook outlining what you can and can't scrape. It specifies which URLs are off-limits. Respecting it is a must!"

3. Use headless browsers, headers and referrers in your requests to the server.

4. When dealing with the login/auth wall content, save those cookies!

5. Dealing with honeypot traps.

6. Use CAPTCHA-solving services. 2Captcha, AntiCaptcha, DeathByCaptcha, AZCaptcha, ImageTyperZ, EndCaptcha, BypassCaptcha, CaptchaTronix are some service providers. NOT cost, performance effective.

We will conduct an in-depth exploration of the following topics in the subsequent sections of this blog

How to Avoid Being Detected as a Bot 🤖🚫

  1. Disabling the Automation Indicator - WebDriver Flags

Automating a browser with tools like Selenium often leaves behind distinctive fingerprints that signal automated control, rather than human interaction. Websites are designed to detect these patterns, which can lead to restricted access.

The window.navigator.webdriver property is indeed a read-only property of the navigator interface in a web browser. It indicates whether the user agent is being controlled by automation (like Selenium). This property is commonly used by websites to detect automated browsing, and it can be customized through Selenium options and flags to reduce or eliminate this indicator.

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

public class CustomizedChromeDriver {

    public static void main(String[] args) {
        // Set the path to your ChromeDriver executable
        System.setProperty("webdriver.chrome.driver", "path_to_chromedriver");
        // Create ChromeOptions instance
        ChromeOptions options = new ChromeOptions();
        // Add the argument to disable webdriver property
        options.addArguments("--disable-blink-features=AutomationControlled");
        // Initialize WebDriver with custom options
        WebDriver driver = new ChromeDriver(options);
        // Now you can use 'driver' to interact with the browser
        driver.get("https://example.com");
        // Rest of your code...
    }
}
from selenium import webdriver

options = webdriver.ChromeOptions() 
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.website.com")

References:

Mozilla - Also check the Browser compatibility here.

StackOverflow

Github

  1. Remove JavaScript Signature

Bot detectors analyze JavaScript signatures in WebDrivers (e.g., ChromeDriver, GeckoDriver) to identify automated access. The signature is stored as 'cdc_' variable. Websites use this to restrict access. We'll employ Agent Ransack/ Notepad++ to locate this signature in the binary file and Vim to remove it. This method applies to other WebDrivers like GeckoDriver and EdgeDriver

1. To evade detection, replace the signature 'cdc_' with a string of the same length, like 'abc_'.

2. Use Vim to open ChromeDriver's binary file. After installation, run vim.exe <pathTo>\chromedriver.exe.

3. Then, type :%s/cdc_/abc_/g to replace 'cdc_' with 'abc_'. Save and exit with :wq.

4. Delete generated files with '~'.

5. Now, try searching with Agent Ransack. The 'cdc_' signature should no longer be found, ensuring your scraper remains as signature used for detecting bot is removed.
  1. Using a Browser Extension

To bypass Selenium detection, employ uBlock Origin, a free browser extension adept at blocking unwanted content. Configure it to thwart JavaScript challenges and CAPTCHAs, mitigating bot detection risks. Install uBlock Origin, set it to block these challenges, and proceed with Selenium interactions. Which may not be universally effective but in combination with other methods, it proves valuable in evasion.

  1. Another interesting approach is to use Undetected_ChromeDriver

The Undetected_ChromeDriver project optimizes Selenium's ChromeDriver to emulate human behaviour, evading anti-bot services like Distill Network, Imperva, DataDome, and Botprotect.io. It excels in handling JavaScript challenges, commonly used to prevent scraping and mitigate cyberattacks and DDoS attacks and also prohibit scraping.

By mimicking legitimate browser behaviour, undetected_ChromeDriver reduces the risk of detection, making it a valuable tool for responsible web automation and scraping projects. Undetected_ChromeDriver improves bot detection evasion over base Selenium. However, it may struggle with robust protections like Cloudflare and Akamai.
Undetected_ChromeDriver

  1. Cloudflare

Cloudflare employs various techniques to detect and mitigate bot activity on websites. Here are some common Cloudflare bot detection techniques and potential ways to bypass them. Cloudflare use Active and passive bot detection techniques.

Active bot detection - CAPTCHAs, Canvas fingerprinting, Event tracking, Environment API querying Browser-specific APIs, Timestamp APIs, Automated Browser Detection, Sandboxing Detection.
Passive bot detection - Detecting botnets, IP address reputation, HTTP request headers, TLS fingerprinting, HTTP/2 fingerprinting

  1. JavaScript Challenges:

JS Challange
Cloudflare employs JavaScript challenges to verify a functional JavaScript engine, a trait linked with genuine browsers. To bypass this, utilize headless browser automation tools like Puppeteer or Undetected_ChromeDriver for seamless interaction
Browser Fingerprinting
Cloudflare analyzes browser fingerprints for bot detection. To bypass this, adjust crucial headers like User-Agent or utilize tools like Puppeteer for authentic-looking fingerprints.
Rate Limiting and CAPTCHAs
Cloudflare uses rate limits and CAPTCHAs to prevent too many requests from one source. To bypass this, rotate IP addresses, use proxies or VPNs, and add delays to mimic human behaviour.
IP Reputation and Blacklisting
Cloudflare blocks flagged IPs linked to bot activity. To bypass, use trusted proxies with low-risk profiles. Rotate IPs to avoid reputation issues.
Machine Learning Algorithms
Difficult but possible to change your scraping methods to avoid and adapt to predictable patterns and evade detection.
Browser Verification Pages
To bypass this, use headless browser automation tools that can autonomously solve these challenges, mimicking human interaction.

In certain instances, the server may present JavaScript challenges that are particularly difficult to overcome, where a deep understanding and application of reverse engineering techniques are applied.

  1. User Behavior Analysis (UBA)

UBA, or User Behavior Analytics, involves monitoring and analyzing user data to distinguish between human users and bots. Anti-scraping systems rely on UBA to identify anomalies in behaviour, flagging potential threats.

Bypassing these systems is difficult due to their adaptive nature. They continuously learn from user data, making solutions effective today and potentially ineffective in the future. Their reliance on AI and machine learning further reinforces their effectiveness in thwarting scraping attempts.

We learnt what, why and how. Now Getting Down to Business💻

  1. Rotate IP proxies

An anti-scraping detection service is triggered when the same IP starts hitting the servers aggressively.

Perform your requests via a proxy server keeping your original requests unmodified, so that the target server will see their IP, not the original IP.

Build your own:

  1. Get a proxy List.

  2. Separate working Proxies From the Failed ones.

  3. Check for failures while scraping and remove them from the working list(if any working proxy fails).

  4. Re-check not working proxies from time to time and update.

  5. Keep random Intervals between requests.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.*;

public class ProxyManager {

    private static List<String> workingProxies = new ArrayList<>();
    private static List<String> failedProxies = new ArrayList<>();

    public static void main(String[] args) {
        List<String> proxyList = loadProxyList("proxylist.txt");

        for (String proxy : proxyList) {
            if (checkProxy(proxy)) {
                workingProxies.add(proxy);
            } else {
                failedProxies.add(proxy);
            }
            waitRandomInterval();
        }

        // Check failed proxies again after some time
        ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor();
        executor.scheduleAtFixedRate(ProxyManager::recheckFailedProxies, 0, 10, TimeUnit.MINUTES);
    }

    private static List<String> loadProxyList(String filePath) {
        List<String> proxyList = new ArrayList<>();
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                proxyList.add(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return proxyList;
    }

    private static boolean checkProxy(String proxyAddress) {
        try {
            URL url = new URL("https://httpbin.org/ip");
            Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyAddress, 8080));
            HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy);
            connection.setRequestMethod("GET");
            int responseCode = connection.getResponseCode();
            return (responseCode == HttpURLConnection.HTTP_OK);
        } catch (IOException e) {
            e.printStackTrace();
            return false;
        }
    }

    private static void recheckFailedProxies() {
        List<String> tempFailedProxies = new ArrayList<>(failedProxies);
        for (String proxy : tempFailedProxies) {
            if (checkProxy(proxy)) {
                failedProxies.remove(proxy);
                workingProxies.add(proxy);
            }
            waitRandomInterval();
        }
    }

    private static void waitRandomInterval() {
        Random rand = new Random();
        int randomDelay = rand.nextInt(5000) + 1000; // Random delay between 1 to 5 seconds
        try {
            Thread.sleep(randomDelay);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

For production oxylabs, datacenter or residential proxies are recommended and that is mentioned in increasing order of cost.

  1. What Is a User Agent?

User Agent (UA) is a string sent by the user's web browser to a web server in the HTTP header to identify the browser type in use, its version and the operating system. Accessed with JavaScript on the client side using navigator.userAgent property, the remote web server uses this information to identify and render the content in a way that's compatible with the device and browser used.

NOTE: Remove Navigator.Webdriver Flag indicates whether the browser is controlled by automation tools such as Selenium.

Most web browsers tend to follow the same format:

Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36

Why Is a User Agent Important for Web Scraping?

Since UA strings help web servers identify the type of browser (and bots) requested, using a list of user agents combined with rotation can help mask your scraper as a web browser.

Exactly like proxies, rotate different UA and use, unlike proxies most UA works.

1. Get a list of [UA](https://explore.whatismybrowser.com/useragents/explore/) you can also check UA at UserAgentString.com
2. It is always a best practice to have random Intervals between requests for both proxies and UA.
3. Always use Up-to-date User Agents/services.
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class UserAgentExample {

    public static void main(String[] args) {
        List<String> userAgentList = new ArrayList<>();

        // Populate the user agent list with different user agents
        userAgentList.add("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36");
        userAgentList.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36");
        userAgentList.add("Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15");

        String url = "https://httpbin.org/headers";

        for (int i = 1; i <= 3; i++) {
            String userAgent = getRandomUserAgent(userAgentList);

            try {
                URL urlObj = new URL(url);
                HttpURLConnection connection = (HttpURLConnection) urlObj.openConnection();

                // Set the User-Agent header
                connection.setRequestProperty("User-Agent", userAgent);

                // Get the User-Agent received in the response
                String receivedUA = connection.getHeaderField("User-Agent");

                // Print the details
                System.out.println("Request #" + i);
                System.out.println("User-Agent Sent: " + userAgent);
                System.out.println("User-Agent Received: " + receivedUA);
                System.out.println();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    public static String getRandomUserAgent(List<String> userAgentList) {
        Random rand = new Random();
        return userAgentList.get(rand.nextInt(userAgentList.size()));
    }
}
import random
import requests

def get_random_user_agent(user_agent_list):
    return random.choice(user_agent_list)

def main():
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15"
    ]

    url = "https://httpbin.org/headers"

    for i in range(1, 4):
        user_agent = get_random_user_agent(user_agent_list)

        headers = {'User-Agent': user_agent}

        try:
            response = requests.get(url, headers=headers)
            received_ua = response.json()['headers']['User-Agent']

            print(f"Request #{i}")
            print(f"User-Agent Sent: {user_agent}")
            print(f"User-Agent Received: {received_ua}\n")

        except Exception as e:
            print(f"Error with request #{i}: {e}\n")

if __name__ == "__main__":
    main()

List of User Agents to avoid getting blocked:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15
Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15
  1. Accessing Restricted Content: Crawling Behind Login Walls

When a website employs a login wall, only authenticated users can access its content. This authentication is determined by HTTP headers, often stored as cookies in the browser. These cookies are generated upon successful login. To crawl such sites, your crawler needs access to these login cookies. You can extract their values from a request in DevTools post-login. Alternatively, you can use a headless browser to automate the login process

Let's dig into the details - sessions and cookies

When a user first connects to a web application, the server will generate a unique session ID and send it to the client in a cookie. On subsequent requests, the client will send that session ID cookie, allowing the server to identify the user and retrieve any data it stored about that session.

The server typically stores session data in memory, a database, or a cache. This data can include things like Username, Preferences, Shopping cart, Access level, and information specific to that user's session

The session ID acts as a key to look at the corresponding session data on the server.

Cookies are one way to implement sessions. The server can set a cookie containing the session ID and send it to the client. On subsequent requests, the client will send that cookie, allowing the server to identify the session.

But cookies are not required for sessions. The session ID can also be passed in URL parameters or HTTP headers. Cookies are just a convenient way to persist the session ID client-side.

Performance

By default, HTTP connections close after each request-response. This requires a TCP handshake and SSL handshake for each request, which adds overhead.

Sessions allow the server to cache authentication and user-specific data*, which can speed up request handling.*

  1. The server does not need to re-authenticate the user on every request. It can simply lookup the session to identify the user.

  2. Information stored in the session (like the shopping cart) can be accessed directly, without needing to query a database on every request.

Process
Sessions and cookies enable this by associating a "session ID" with the client. On subsequent requests, the client sends the session ID, allowing the server to identify the same TCP connection and reuse it. When you make requests within a requests.Session() in Python, urllib3 will reuse the underlying TCP connection by default, taking advantage of HTTP keep-alive. This connection reuse reduces the overhead of establishing new connections, DNS lookups, SSL handshakes, etc. Requests can be sent and responded to more quickly.
Pros:
Reduced overhead: By reusing an existing TCP connection, there is less overhead from establishing new connections for each request. This includes the TCP handshake, DNS lookups, and SSL handshakes. Faster requests: Since the connection is already established, requests and responses can happen more quickly. There is no delay in setting up a new connection. Caching: Information stored in the session, like the user ID or authentication token, can be accessed directly without needing to re-authenticate for each request. This can speed up request handling.
Cons:
Dependency on server: The server must support keep-alive connections and HTTP persistent connections. Not all servers do this by default. State management: Managing the session state and cookies adds complexity to your scraping code. You have to handle storing, updating, and expiring the session/cookies. Errors: If an error occurs in the middle of a keep-alive connection, you may have to re-establish the connection. This can cause issues. Security risks: Sessions and cookies introduce some security risks if not implemented properly. They could expose your authentication tokens or session IDs. Resource exhaustion: Keeping too many connections alive could exhaust your server's resources if not rate-limited.
Better way:
Managing connection reuse at the library level can be a simpler and more secure alternative to using sessions and cookies directly in your scraping code.

Session

import requests
from bs4 import BeautifulSoup
import datetime

s = requests.Session()

def get_title(x):
    url = f'https://scrapethissite.com/pages/forms/?page_num={x}'
    r = requests.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    print(sp.title.text+" "+str(x))
    return

def get_title_Session(x):
    url = f'https://scrapethissite.com/pages/forms/?page_num={x}'
    r = s.get(url)
    sp = BeautifulSoup(r.text, 'html.parser')
    print(sp.title.text+" "+str(x))
    return

# Create a start timestamp
start = datetime.datetime.now()
# Loop through pages
for x in range(1, 21):
    get_title_Session(x)

# Calculate the time taken
finish = datetime.datetime.now() - start
print(finish)

# with session 0:00:24.839232
# without session 0:00:37.804953

Cookies

import pickle
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Define constants
URL = "https://www.instagram.com/"
USERNAME = "EnterValidUsername"
PASSWORD = "EnterValidPassword"
COOKIES_FILE = "cookies.pkl"
WAIT_TIME = 10  # Adjust as needed

# Initialize the webdriver with options
options = webdriver.ChromeOptions() 
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(options=options)

def login():
    try:
        with open(COOKIES_FILE, 'rb') as cookies_file:
            cookies = pickle.load(cookies_file)

        # Visit the target website
        driver.get(URL)
        time.sleep(10)  # Adjust sleep time as needed

        # Add the cookies to the browser
        for cookie in cookies:
            driver.add_cookie(cookie)

        # Refresh the page (or navigate to the target page)
        driver.refresh()
        print("Page reloaded with cookies")

        wait = WebDriverWait(driver, 10)
        # Check if 'Not Now' button is present and click it
        if EC.presence_of_element_located((By.XPATH, "//button[text()='Not Now']")):
            offNotification = wait.until(EC.presence_of_element_located((By.XPATH,"//button[text()='Not Now']")))
            offNotification.click()
            time.sleep(2)
        # You are now logged in

    except Exception as e:
        print(f"Error during login using cookies: {e}")
        print("Getting new cookies via logging in")
        getCook()

    finally:
        driver.quit()

def getCook():
    try:
        driver.get(URL)

        wait = WebDriverWait(driver, 10) # Wait for up to 10 seconds
        username = wait.until(EC.presence_of_element_located((By.XPATH,"//input[@name='username']")))
        password = driver.find_element(By.CSS_SELECTOR,"input[name='password']")
        username.clear()
        password.clear()

        username.send_keys(USERNAME)
        password.send_keys(PASSWORD)

        # Wait for the button to become clickable
        wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@type='submit']"))).click()

        offNotification = wait.until(EC.element_to_be_clickable((By.XPATH,"//button[text()='Not Now']")))
        offNotification.click()
        time.sleep(2)

        # Get the cookies
        cookies = driver.get_cookies()

        # Save the cookies to a file using pickle
        with open(COOKIES_FILE, 'wb') as cookies_file:
             pickle.dump(cookies, cookies_file)

    except Exception as e:
        print(f"Error during login: {e}")

    finally:
        driver.quit()

# Example usage:
try: 
    login()
except Exception as e:
    print(f"An error occurred: {e}")

Understanding the lifecycle of a website is fundamental in web scraping. Each website presents its unique challenges, and addressing them requires a tailored approach. Let's delve into an innovative solution for handling the infinite scroll issue on the JioMart website and extracting information on beverages.

Challenge: Infinite Scroll

JioMart's extensive range of products poses a challenge due to its infinite scroll feature. Traditional methods like Selenium may struggle with loading such large quantities of data.

Solution: Categorization and Targeted Scraping

  1. Initial Page Load: Begin by navigating to the beverages section. This will be your starting point.

  2. Category Selection: Instead of scraping all 20K products at once, categorize them by brand or type. For instance, focus on a specific brand, such as Coca-Cola.

  3. Sub-Categorization: Within the selected brand, further narrow down the products by subcategories. This could be alphabetical or any other relevant criteria.

  4. API Calls: Utilize the website's APIs, if available, to fetch data efficiently. APIs can often provide structured data in JSON format, bypassing the need for extensive DOM traversal.

  5. Dynamic Loading Techniques: If the website heavily relies on JavaScript for data loading, consider employing dynamic loading techniques. This involves directly interacting with the JavaScript code to trigger data-loading events.

Honeypots in Web Scraping: Challenges & Solutions

  1. Data Integrity Issues

    • Challenge: Honeypots may provide misleading data, affecting analysis accuracy.

    • Solution: Verify website authenticity before scraping for reliable results.

  2. IP Blocking

    • Challenge: Honeypots can lead to IP blocking, limiting access.

    • Solution: Use rotating proxies to mask your IP address and prevent blocking.

  3. Legal Implications

    • Challenge: Accidental interaction with honeypots can raise legal concerns.

    • Solution: Adhere to ethical scraping practices and respect robots.txt guidelines.

  4. Workflow Disruption

    • Challenge: Honeypots can disrupt scraping, causing inefficiencies.

    • Solution: Implement request spacing and avoid hidden links for smoother scraping.

  5. Network Resource Misuse

    • Challenge: Honeypots may lead to resource wastage on your infrastructure.

    • Solution: Monitor network traffic, optimize scraper behavior, and address anomalies.

  6. Headless Browser Detection

    • Challenge: Some honeypots detect headless browsers commonly used for scraping.

    • Solution: Implement stealth measures, like customizing user agents, to mimic human interaction.

Quick Note

  1. Always Check for the Hidden API.

  2. Understand the life cycle of the website.

  3. Sometimes valuable information is available in js! (dynamic content loading) yeah don't miss anything

  4. Have fun scrapping!🦉

The End

In the ever-evolving landscape of web scraping, the quest for knowledge knows no bounds. As we navigate through challenges like honeypots, IP blocking, and legal considerations, we emerge stronger and more ingenious. Remember, innovation is not just a solution; it's a mindset. So, arm yourself with the latest techniques, stay ethically sound, and let the digital realm unveil its secrets. Happy scraping, fellow explorers! 🚀