Parsing or web scraping refers to extracting the required data from the websites. The rvest
library in R provides parsing functionality.
We can parse a webpage with R in the following three steps:
Import the rvest
library.
Read the HTML code.
Scrap the required data from the HTML code.
Here is an R code that scraps data from a Wiki page.
library (rvest)# Read the HTMLwebpage = read_html("https://en.wikipedia.org/wiki/Web_scraping")# Scrape data with CSS selectordata = html_node(webpage, '.mw-page-title-main')# Convert the data to texttext = html_text(data)print(text)
Line 1: We import the rvest
library.
Line 4: We use the read_html()
function to fetch the downloaded HTML from the Wiki URL provided as a parameter.
Line 7: We scrape the page's title from the HTML code stored in the webpage
. In this case, the CSS selector for the title is mv-page-title-main
.
Line 10: We convert the value stored in data
to readable form, i.e., text.
Try changing the CSS selector at line 7 to 'p'
. This will scrape all the paragraph sections.
Note: In case, a pre-added CSS selector doesn't work, try inspecting the element and verify the CSS code.
Free Resources