Web scraping: implementation using Python

Web scraping is used to collect large information from websites.
JIRA CODE – JJ – 134

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.

Steps of Web Scrapping:

· Inspecting the Page
· Find the data you want to extract
· Write the code
· Run the code and extract the data
· Store the data in the required format

Implementation code:

import requests
from bs4 import BeautifulSoup  
URL = "http://www.values.com/inspirational-quotes"
r = requests.get(URL) 
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Sample output of Web Scrapping:

<!DOCTYPE html>
<html class="no-js" dir="ltr" lang="en-US">
 <head>
  <title>
   Inspirational Quotes - Motivational Quotes - Leadership Quotes | PassItOn.com
  </title>
  <meta charset="utf-8"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width,initial-scale=1.0,maximum-scale=1" name="viewport"/>
  <meta content="The Foundation for a Better Life | Pass It On.com" name="description"/>
  <link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="/site.webmanifest" rel="manifest"/>
  <link color="#c8102e" href="/safari-pinned-tab.svg" rel="mask-icon"/>
  <meta content="#c8102e" name="msapplication-TileColor"/>
  <meta content="#ffffff" name="theme-color"/>
  <link crossorigin="anonymous" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" rel="stylesheet"/>
  <link href="/assets/application-37a5d7aa5ae64e3dfae4be1c91df4127.css" media="all" rel="stylesheet"/>
  <meta content="authenticity_token" name="csrf-param"/>
  <meta content="NQkw46UV67kc3o09ILquz+ck+/C45DiHdOl1w33U7HnFLgkr1L3VBl3mIZpSVkzg6AgTpzCGW5D81zU5UovZPg==" name="csrf-token"/>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-1179606-29">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
          function gtag(){dataLayer.push(arguments);}
          gtag('js', new Date());
          gtag('config', 'UA-1179606-29');
  </script>
  <script>
   window.fbAsyncInit = function() {
            FB.init({
              appId            : '483774921971842',
              autoLogAppEvents : true,
              xfbml            : true,
              version          : 'v6.0'
            });
          };
  </script>
 

Leave a comment

Your email address will not be published. Required fields are marked *