If you like to learn to extract data from public webpage and have a basic understanding of Java CSS and HTML, then this tutorial is for you. You will inspect the HTML structure of your target site with your browser’s developer tools. I wrote this article is for 𝐅𝐎𝐑 𝐄𝐃𝐔𝐂𝐀𝐓𝐈𝐎𝐍𝐀𝐋 𝐏𝐔𝐑𝐏𝐎𝐒𝐄 𝐎𝐍𝐋𝐘 as I will work on IMDB website.
Before we move to real web scraping section, I will show how to extract the data from simple DOM. For example, we have very simple structure elements as below
HTML
<div id="block">
<h1>My First Heading</h1>
<p>My first paragraph.</p>
<p>My second paragraph.</p>
<img src="http://example.com/example.jpg"/>
</div>
To get the text inside p
tag we have two options for CSS selector to traverse. We can start from the root id or directly select the p
tag.
Java
//we can select directly with p, because it have only one p
Element p1 = doc.select("p").first();
String tx1 = p1.text();
//This more specify from the root with id and inside that have p element
Element p2 = doc.select("#block p").first();
String tx2 = p2.text();
//print
System.out.println(tx1);
System.out.println(tx2);
doc
variable stores the Document of HTML DOM, just ignore this a moment.select()
is the method to find desired CSS selector return as Elementsfirst()
is to return the first element that found return as single Element.text()
is to extract any text from the element.The Java code will print the result My first paragraph. as we use first()
method to return first single value. Anyway if you want the second p
text by just remove first()
method into get(1)
because select()
will return as collection Elements and we could traverse by its index.
Next step we will try to get the image source from img
element. The same as before we can traverse from the root with id or directly select the img
tag.
Java
//we can select directly with img
Element img1 = doc.select("img").first();
String src1 = img1.attr("src");
//This more specify from the root with id and inside that have img element
Element img2 = doc.select("#block img").first();
String src2 = img2.attr("src");
//print
System.out.println(src1);
System.out.println(src2);
Everything is almost the same but you will find out a new method.
attr()
is a method for extracting the attribute value from HTML as img
element has src
attribute so we can place src
text in the method.
I hope you get some idea with this section, so let's get started scraping on real web.
Step 1: Create a new Java project in Eclipse (File -> New -> Java Project -> Name project -> Finish)
Step 2: Create new class for running the code. Right click on src -> New -> Class -> Name class -> Tick public static void main(String[] args) -> Finish
Step 3: Add Jsoup library to Java project. Copy the jar file (Jsoup) and paste it into src folder. Then right-click on jar file -> Build Path -> Add to Build Path
Please take a look at the short video below to follow step 1 to step 3
Document doc = Jsoup.connect("https://www.imdb.com/chart/top").timeout(6000).get();
Step 5: We will get the title movie by extracting the data from HTML, but before we do that we have to inspect the element in the browser with ctl+shift+I in Chrome then we are going to test the number of movies in list with the CSS selection below.
Java
Elements body = doc.select("tbody.lister-list");
System.out.println(body.select("tr").size());
For tbody.lister-list
the selector defines to find table body that has class lister-list
and we print the number of elements tr
. We have selected tr
and return as a collection of elements we know collection and loop like brotherhood.
Step 6: We will loop the tr
elements and select the attribute of the image.
for(Element e : body.select("tr"))
{
String img = e.select("td.posterColumn img").attr("src");
System.out.println(img);
}
We will see all list source thumbnail movies. Let's pick one element to analyze in for loop. The purpose of step 6 we want to get link image so we start getting a tr
element which contains td
element inside and we can specify with class name posterColumn
. Inside that, we have an image element so we have to specify with img
tag after all we will get td.posterColumn img
. Next, we can call the method attr()
to manipulate the attribute value which is in src
.
alt
attriubte in image or in a
tag, we have options so in this case I will get the title from alt
attriubte of image. We can write as below code:for(Element e : body.select("tr"))
{
String title = e.select("td.posterColumn img").attr("alt");
System.out.println(title);
}
In conclusion, the main point to handle the scraping in this section is about CSS selectors. If you already understand about basic CSS, HTML and Java you are ready to go.
Please watch the below video for more detail
You might Also Like:
Java Video Web scraping
Founder of CamboTutorial.com, I am happy to share my knowledge related to programming that can help other people. I love write tutorial related to PHP, Laravel, Python, Java, Android Developement, all published post are make simple and easy to understand for beginner. Follow him