Open a World of Opportunities: Web Scraping Using PHP and Python

Open a World of Opportunities: Web Scraping Using PHP and Python

The latest estimates says, the total number of websites has crossed one billion mark; everyday a new site is being added and removed, but the record stays.

Having said that, just imagine how much data is floating around the web. The amount is so huge that it would be impossible for even hundreds of humans to digest all the information in a lifetime. To tackle such large amounts of data, you not only need to have easy access to all the information but should also process some scalable way to gather data in order to organize and analyze it. And that’s exactly where web data scraping comes into picture.

Web scraping, data mining, web data extraction, web harvesting or screen scraping – they all means the same thing – a technique in which a computer program fetches huge piles of data from a website and saves them in your computer, spreadsheet or database in a normal format for easy analysis.

2

Web Scraping with Python and BeautifulSoup

In case, you are not satisfied with the internet sources of web scraping, you are most likely to develop your very own data scraping tools, which is quite easier. In this blog we will show you how to frame a web scraper with Python and very simple yet dynamic BeautifulSoup Library:

First, import the libraries we will use: requests and BeautifulSoup:

# Import libraries
import requests
from bs4 import BeautifulSoup

Secondly, point out the variable for the URL using request.get method and gain access to the HTML content right from this page:

import requests
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)
print(r.content)

Next, we will parse a webpage, and for that, we need to create a BeautifulSoup object:

import requests 
from bs4 import BeautifulSoup
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL)

 # Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now, let’s extract some meaningful information from HTML content. Look at the HTML content of the webpage, which was printed using the soup.pretify()method..

table = soup.find('div', attrs = {'id':'container'})

Here, you will find each quote inside a div container, belonging to the class quote.

We will repeat the process with each div container, belonging to the class quote. For that, we will use findAll()method and repeat the process with each quote using variable row.

After which, we will create a dictionary, in which all the data about the quote will be saved in a list, and is called ‘quotes’.

    quote['lines'] = row.h6.text

Now, coming to the final step – write down the data to a CSV file, but how?

See below:

filename = 'inspirational_quotes.csv'
with open(filename, 'wb') as f:
    w = csv.DictWriter(f,['theme','url','img','lines','author'])
    w.writeheader()
    for quote in quotes:
        w.writerow(quote)

This type of web scraping is used on a small-scale; for larger scale, you can consider:

Scraping Websites with PHP and Curl

To connect to a large number of servers and protocols, and download pictures, videos and graphics from several websites, consider Scraping Websites with PHP and cURL.

<?php

function curl_download($Url){

    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    }

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    curl_close($ch);

    return $output;

print curl_download('http://www.gutenberg.org/browse/scores/top');

?>

In a nutshell, the scopes of using web scraping for analyzing content and applying it to your content marketing strategies are vast like the horizon. Armed by endless types of data analysis, web scraping technology has proved to be a valuable tool for the content producers. So, when are you feeding yourself with web scraping technology?

Discover the perfect platform for excellent R programming using Python courses. For more information on R programming training institute drop by DexLab Analytics.

 
This post originally appeared ondzone.com/articles/be-leading-content-provider-using-web-scraping-php
 

Interested in a career in Data Analyst?

To learn more about Data Analyst with Advanced excel course – Enrol Now.
To learn more about Data Analyst with R Course – Enrol Now.
To learn more about Big Data Course – Enrol Now.

To learn more about Machine Learning Using Python and Spark – Enrol Now.
To learn more about Data Analyst with SAS Course – Enrol Now.
To learn more about Data Analyst with Apache Spark Course – Enrol Now.
To learn more about Data Analyst with Market Risk Analytics and Modelling Course – Enrol Now.

September 7, 2017 7:32 am Published by , , , , , , ,

, , , , , ,

Comments are closed here.

...

Call us to know more

×