Parse HTML String

alihallo

Currently I try to create my first little module and could need a hint from an experienced person.
In my node_helper.js I request the html code from an simple website by using:

var options = {url: URL};
        request(options, (error, response, body) => {
            if (response.statusCode === 200) {
                this.sendSocketNotification("DATA", this.parseData(body));

So in the variable body there is something like

<head>
...
</head>
<body><a name="top"></a>
   <div id="data_1">
      <p>The relevant data 1</p>
   </div>
   <div id="data_2">
      <p>The relevant data 2</p>
   </div>
...

My question is: What would you recommend to get the relevant data from this body into an local variable?
I first thought to use javascript & regex, but I guess that is not a good way to solve the issue, isn’t it?

Best regards,
alihallo

cowboysdude

I could be wrong but I believe you could request it using NPM request…

alihallo

Thank you for your answer, but I coudn’t find out how to use the NPM request to parse the html code.
But I found another solution to solve the issue:

https://github.com/cheeriojs/cheerio

This way I could get the data out of the html code like this:

var options = {url: URL};
        request(options, (error, response, body) => {
            if (response.statusCode === 200) {
                this.sendSocketNotification("DATA", this.parseData(body));


...


var $ = cheerio.load(body, {
   normalizeWhitespace: true,
   xmlMode: false
});
		
$('div[class=data_1]').find('p').each(function (index, element) {
	data_array.push($(element).text());
});

This way I could solve my problem.

Best regards,
alihallo

strawberry 3.141

I think it’s weird that this works, because your looking for attribute class = data_1 but it’s an id

the css selector for an id is #, and when you put p behind it will look for paragraphs in the element with the id data_1

when you replace

$('div[class=data_1]').find('p').each(function (index, element) {
	data_array.push($(element).text());
});

with

data_array.push($('#data_1 p').text());

does it still work? Not sure if it will return the element if just one occurance is found or will return an array anyways

ianperrin

@alihallo

If your input html file is fairly simple, you may be able to avoid the use of the cheerio library entirely

// an array to hold the data from the file
var data_array = [];
// Get all p tag elements inside div tag elements with an id that starts with 'data_'
var data_tags = body.querySelectorAll('div[id^=data_] p');
// Loop through data tags and add content to data array
for (i = 0; i < data_tags.length; i++) { 
    data_array.push(data_tags[i].innerHTML);
}

Of course the more complex your input file is the more you might benefit from the use of cheerio.

Plati

I want to create a module that gets data from a website in div id.

Example:

Website code:

<b class="b2 nieb" title="Kurs EUR na żywo" id="EURPLN">4.33320</b>

And i want display value of id="EURPLN"

I want to set the config which site is to collect data and from which the ID

example:

defaults: {
		url: http://domain.com/
		findID: "EURPLN"
}

how to do it?

Note from admin: Please use Markdown on code snippets!

alihallo

@strawberry-3-141 You are absolutly right, I made a bad example. The real html code is a little bit more complicated, so I mixed the real code with this example.
@ianperrin My input html is more complicated, but thanks for your answer, good to know!

@Plati It helped me a lot to look at other modules. You should use a node_helper.js, there you can create a function which gets the html code of the website.

//function which gets the data from the given URL
getTheData: function(theURLtoCatch) {
   var options = {url: theURLtoCatch};
   request(options, (error, response, body) => {
      if (response.statusCode === 200) {
         this.sendSocketNotification("DATA", this.parseHTML(body));
      } else {
         console.log("Error getting Data " + response.statusCode);
         this.sendSocketNotification("ERROR", response.statusCode);
      }
   });
},

parseHTML: function(dataBody) {
   //use something like ianperrin and strawberry showed in his example

}