Sign in to follow this  
Followers 0
tuntundu

Read Html File From Vb.net

15 posts in this topic

Dear expert,

I am currently want to writing an application using VB.net to read Html format files and convert it into Text file, any suggestion or sample code to provide for me.

 

Br,

tuntundu

0

Share this post


Link to post
Share on other sites

U hav to read ur .htm file charecter by charecter and copy them to .txt file

but on encountering "<" u hav to ignore the text until u incounter a ">" .

Thus in way all tags will be removed and ur .htm file will be converted into .txt file.

 

And if u need help in reading and writing files u can refer here:

http://www.builderau.com.au/program/windows/0,39024644,20267367,00.htm

 

hav a nice time ;)

0

Share this post


Link to post
Share on other sites

If u need to read the html file as tags, u can the mshtml object model. First load the document and then parse the file by parsing tags as required. Please give us a sample file and what u intend to do with it while reading the file. This will help us give u a more better solution.

 

Cheers

Duser

0

Share this post


Link to post
Share on other sites

i think i face the same application, but i use c# , i use the xml namespace ,specially the xmlTextReader class to navigate through an html file u can get tags values attributes and process them ,

 

if u need further help dont hesitate

0

Share this post


Link to post
Share on other sites
i think i face the same application, but i use c# , i use the xml namespace ,specially the xmlTextReader class to navigate through an html file u can get tags values attributes and process them ,

 

if u need further help dont hesitate

How do u handle a poorly written html file that does not comply with xml. I guess this is the area where the advantage of using the mshtml object overrides the xmlreader classes.

Of course, if ur html file is from a trusted source and compliant with xml, ur solution is the ideal choice.

 

Cheers

Duser

0

Share this post


Link to post
Share on other sites

Hi Duser,

below is the sample format of Html file I need to read.

 

<H3>header</H3>

<TABLE border="1" width="80%">

<tr>

<td align="right">test1</td>

<td align="left"><b>-</b></td>

</tr>

<tr>

<td align="right">test2</td>

<td align="left"><b>-</b></td>

</tr>

<tr>

<td align="right">test3</td>

<td align="left"><b>RUSSTKB</b></td>

</tr>

</TABLE>

 

Thanks a lot,

Oscar

0

Share this post


Link to post
Share on other sites

this is the test file.html:

--------------------------

 

<html>

<head>

</head>

<body>

<H3>header</H3>

<TABLE border="1" width="80%">

<tr>

<td align="right">test1</td>

<td align="left"><b>-</b></td>

</tr>

</TABLE>

</body>

</html>

 

this is the code:

------------------

 

listBox1.Items.Clear();

int i=0;

XmlTextReader xtr = new XmlTextReader("c:\\test.html");

xtr.WhitespaceHandling=WhitespaceHandling.None;

 

while(xtr.Read())

{

string s = i.ToString()+"-"+xtr.NodeType+" "+xtr.Name+" "+xtr.Value.ToString()+" ";

 

if(xtr.HasAttributes)

{

for(int j=0;j<xtr.AttributeCount;j++)

{

xtr.MoveToAttribute(j);

s+=" ATT:"+xtr.Name+" ="+xtr.GetAttribute(j);

}

}

listBox1.Items.Add(s);

i++;

}

 

 

and this what is in the listbox:

----------------------------------

 

0-Element html

1-Element head

2-EndElement head

3-Element body

4-Element H3

5-Text header

6-EndElement H3

7-Element TABLE ATT:border=1 ATT:width=80%

8-Element tr

9-Element id ATT:align =right

10-Text test1

11-EndElement td

12-Element td ATT:align =left

13-Element b

14-Text -

15-EndElement b

16-EndElement td

17-EndElement tr

18-EndElement TABLE

19-EndElement body

20-EndElement html

 

 

 

you can know the type(element,atribute..), name(html,head...),value(-,border....),and the depth of the node

0

Share this post


Link to post
Share on other sites

tuntundu,

Looks like ur html doc is well formed except for the root element. So an xml based solution shud work for you.

Just another question, is this a form based application or a console app or a class application. My question is based on usage of mshtml since the document element needs to be created from an existing document and hence it might be required to host the webbrowser control as well. Of course, with the xml based solution, these drawbacks dont exist.

 

Cheers

Duser

0

Share this post


Link to post
Share on other sites

Hi Duser,

My requirement is just read this html files and create the data to database.

May be "minafawzi" way can be done, but I have to translate to VB code and test it.

 

BR,

tuntundu

0

Share this post


Link to post
Share on other sites

Oops didnt notice that it was C# code. I will write a small script assuming u are reading this html file from disk as is.

 

Cheers

Duser

Edited by duser2k3
0

Share this post


Link to post
Share on other sites

Man the classes name is the same in c# and vb.net

dont worry about the conversion

 

Also u can make a c# class and consume it in VB.net

0

Share this post


Link to post
Share on other sites

tuntundu,

I created a new console app. U can create the same and replace the module contents. No additional references.

 

The Console Module

Module modMain

 Sub Main()
   Dim sr As System.IO.StreamReader
   'Change this path to the one on ur machine
   sr = New System.IO.StreamReader("\Documents and Settings\Administrator\Desktop\sample.htm")
   Dim htmlContent As String
   htmlContent = sr.ReadToEnd
   sr.Close()

   Dim oParseHTML As ParseHTML
   oParseHTML = New ParseHTML(htmlContent)

   With oParseHTML
     Console.WriteLine("<<----XML Content follows---->>" & vbCrLf & .xmlContent)

     Console.WriteLine("<<----Found " & .MaxParams & " parameters---->>")

     Dim i As Integer
     Console.WriteLine("<<----Listing by integer index follows---->>")
     For i = 1 To .MaxParams
       Console.WriteLine("Item [" & i & "] : " & .ParamValue(i))
     Next


     'This is the probably the most relevant portion of the sample output
     Console.WriteLine("<<----Listing by name follows---->>")
     Dim ParamName As String
     ParamName = "test1" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
     ParamName = "test2" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
     ParamName = "test3" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
   End With

   Console.ReadLine()
 End Sub

 Private Class ParseHTML
   Private _HTMLString As String
   Private _xmlDoc As Xml.XmlDocument

   Private _Params As Collection

   Public Sub New(ByVal HTMLString As String)
     _Params = New Collection

     _xmlDoc = New Xml.XmlDocument
     _xmlDoc.LoadXml("<root />")

     Dim RootEle As Xml.XmlElement
     RootEle = _xmlDoc.DocumentElement
     RootEle.InnerXml = HTMLString      'if u are unsure of the html file, handle this in a try catch to handle an incorrect html file

     'Get the table node
     Dim xTableNode As Xml.XmlNode
     xTableNode = RootEle.SelectSingleNode("TABLE")

     Dim xTRNode As Xml.XmlNode
     Dim xTDNode As Xml.XmlNode
     Dim ParamName As String
     Dim ParamValue As String
     'Get each row node
     For Each xTRNode In xTableNode.SelectNodes("tr")      'if u are unsure of the html file, handle this in a try catch to handle a missing TABLE node
       'Get the left column
       xTDNode = xTRNode.SelectSingleNode("td[@align='right']")
       ParamName = xTDNode.InnerText        'if u are unsure of the html file, handle this in a try catch to handle a missing td node
       'Get the right column
       xTDNode = xTRNode.SelectSingleNode("td[@align='left']")
       ParamValue = xTDNode.InnerText        'if u are unsure of the html file, handle this in a try catch to handle a missing td node
       'Save the info
       _Params.Add(ParamValue, ParamName)
     Next
   End Sub

   '<summary>
   'Left for Debugging purposes only and can be elminated in the final release
   '</summary>
   Public ReadOnly Property xmlContent() As String
     Get
       Return _xmlDoc.InnerXml
     End Get
   End Property

   '<summary>
   'Access ParamValue content by Index
   '</summary>
   Public ReadOnly Property ParamValue(ByVal ParamIndex As Integer) As String
     Get
       Return CType(_Params.Item(ParamIndex), String)
     End Get
   End Property

   '<summary>
   'Access ParamValue content by Name of the parameter
   '</summary>
   Public ReadOnly Property ParamValue(ByVal ParamName As String) As String
     Get
       Return CType(_Params.Item(ParamName), String)
     End Get
   End Property

   '<summary>
   'Get the maximum number of Parameters in the html file
   '</summary>
   Public ReadOnly Property MaxParams() As Integer
     Get
       Return _Params.Count
     End Get
   End Property

 End Class
End Module

 

Sample.htm

<H3>header</H3>
<TABLE border="1" width="80%">
<tr>
<td align="right">test1</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test2</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test3</td>
<td align="left"><b>RUSSTKB</b></td>
</tr>
</TABLE>

 

I just added the sample.htm file to my desktop. U may need to set that location correctly in code.

 

minafawzi's code is spot on using XMLTextReader and is a quick implementation. My code is a VB.NET implementation with a relatively slower speed during runtime, Trust me you wont notice the difference unless ur xml file is at least 10000 odd lines ;)

 

Of course, all this would fail provided ur html file was not correctly formatted as xml as u have provided in the sample.

Cheers

Duser

Edited by duser2k3
0

Share this post


Link to post
Share on other sites

duser2k3,

this is great , this is the way to deal with HTML files whether they are well formed or not

 

ThnX

0

Share this post


Link to post
Share on other sites

Hi Duser,

It work for my requirement.

for minafawzi's code I have to add additional start and end root in the Html file.

 

any way, thanks a lot of Duser and Minafawzi helping me to solve the problem.

 

BR,

tuntundu

0

Share this post


Link to post
Share on other sites

minafawzi, great job mate. My code doesnt really work for badly formed XML docs, just the ones without a root element. Guess what, the comments were the most difficult part.

 

tuntundu, maybe u can close this post then and recommend minafawzi for the post of the month too.

 

Cheers

Duser

0

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.
Sign in to follow this  
Followers 0