Jump to content


Read Html File From Vb.net


  • This topic is locked This topic is locked
14 replies to this topic

#1 tuntundu

tuntundu

    Member

  • Members
  • PipPip
  • 82 posts

Posted 25 May 2005 - 05:17 PM

Dear expert,
I am currently want to writing an application using VB.net to read Html format files and convert it into Text file, any suggestion or sample code to provide for me.

Br,
tuntundu
  • 0

#2 harsh_puri

harsh_puri

    Thê Tê®mînåTø®

  • Veterans
  • PipPipPipPipPip
  • 2668 posts
  • Location:~~HARSH World~~
  • Interests:PROGRAMMING, PLAYING COMPUTER GAMES & SADIKHOV.COM

Posted 25 May 2005 - 06:15 PM

U hav to read ur .htm file charecter by charecter and copy them to .txt file
but on encountering "<" u hav to ignore the text until u incounter a ">" .
Thus in way all tags will be removed and ur .htm file will be converted into .txt file.

And if u need help in reading and writing files u can refer here:
http://www.builderau.com.au/program/windows/0,39024644,20267367,00.htm

hav a nice time ;)
  • 0

#3 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 25 May 2005 - 08:28 PM

If u need to read the html file as tags, u can the mshtml object model. First load the document and then parse the file by parsing tags as required. Please give us a sample file and what u intend to do with it while reading the file. This will help us give u a more better solution.

Cheers
Duser
  • 0

#4 minafawzi

minafawzi

    Advanced Member

  • Members
  • PipPipPip
  • 374 posts

Posted 25 May 2005 - 11:51 PM

i think i face the same application, but i use c# , i use the xml namespace ,specially the xmlTextReader class to navigate through an html file u can get tags values attributes and process them ,

if u need further help dont hesitate
  • 0

#5 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 26 May 2005 - 04:29 PM

i think i face the same application, but i use c# , i use the xml namespace ,specially the xmlTextReader class to navigate through an html file u can get tags values attributes and process them ,

if u need further help dont hesitate

How do u handle a poorly written html file that does not comply with xml. I guess this is the area where the advantage of using the mshtml object overrides the xmlreader classes.
Of course, if ur html file is from a trusted source and compliant with xml, ur solution is the ideal choice.

Cheers
Duser
  • 0

#6 tuntundu

tuntundu

    Member

  • Members
  • PipPip
  • 82 posts

Posted 27 May 2005 - 08:16 PM

Hi Duser,
below is the sample format of Html file I need to read.

<H3>header</H3>
<TABLE border="1" width="80%">
<tr>
<td align="right">test1</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test2</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test3</td>
<td align="left"><b>RUSSTKB</b></td>
</tr>
</TABLE>

Thanks a lot,
Oscar
  • 0

#7 minafawzi

minafawzi

    Advanced Member

  • Members
  • PipPipPip
  • 374 posts

Posted 27 May 2005 - 10:44 PM

this is the test file.html:
--------------------------

<html>
<head>
</head>
<body>
<H3>header</H3>
<TABLE border="1" width="80%">
<tr>
<td align="right">test1</td>
<td align="left"><b>-</b></td>
</tr>
</TABLE>
</body>
</html>

this is the code:
------------------

listBox1.Items.Clear();
int i=0;
XmlTextReader xtr = new XmlTextReader("c:\\test.html");
xtr.WhitespaceHandling=WhitespaceHandling.None;

while(xtr.Read())
{
string s = i.ToString()+"-"+xtr.NodeType+" "+xtr.Name+" "+xtr.Value.ToString()+" ";

if(xtr.HasAttributes)
{
for(int j=0;j<xtr.AttributeCount;j++)
{
xtr.MoveToAttribute(j);
s+=" ATT:"+xtr.Name+" ="+xtr.GetAttribute(j);
}
}
listBox1.Items.Add(s);
i++;
}


and this what is in the listbox:
----------------------------------

0-Element html
1-Element head
2-EndElement head
3-Element body
4-Element H3
5-Text header
6-EndElement H3
7-Element TABLE ATT:border=1 ATT:width=80%
8-Element tr
9-Element id ATT:align =right
10-Text test1
11-EndElement td
12-Element td ATT:align =left
13-Element b
14-Text -
15-EndElement b
16-EndElement td
17-EndElement tr
18-EndElement TABLE
19-EndElement body
20-EndElement html



you can know the type(element,atribute..), name(html,head...),value(-,border....),and the depth of the node
  • 0

#8 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 28 May 2005 - 12:06 AM

tuntundu,
Looks like ur html doc is well formed except for the root element. So an xml based solution shud work for you.
Just another question, is this a form based application or a console app or a class application. My question is based on usage of mshtml since the document element needs to be created from an existing document and hence it might be required to host the webbrowser control as well. Of course, with the xml based solution, these drawbacks dont exist.

Cheers
Duser
  • 0

#9 tuntundu

tuntundu

    Member

  • Members
  • PipPip
  • 82 posts

Posted 30 May 2005 - 11:30 AM

Hi Duser,
My requirement is just read this html files and create the data to database.
May be "minafawzi" way can be done, but I have to translate to VB code and test it.

BR,
tuntundu
  • 0

#10 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 30 May 2005 - 01:18 PM

Oops didnt notice that it was C# code. I will write a small script assuming u are reading this html file from disk as is.

Cheers
Duser

Edited by duser2k3, 30 May 2005 - 01:29 PM.

  • 0

#11 minafawzi

minafawzi

    Advanced Member

  • Members
  • PipPipPip
  • 374 posts

Posted 30 May 2005 - 01:57 PM

Man the classes name is the same in c# and vb.net
dont worry about the conversion

Also u can make a c# class and consume it in VB.net
  • 0

#12 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 30 May 2005 - 02:05 PM

tuntundu,
I created a new console app. U can create the same and replace the module contents. No additional references.

The Console Module
Module modMain

  Sub Main()
    Dim sr As System.IO.StreamReader
    'Change this path to the one on ur machine
    sr = New System.IO.StreamReader("\Documents and Settings\Administrator\Desktop\sample.htm")
    Dim htmlContent As String
    htmlContent = sr.ReadToEnd
    sr.Close()

    Dim oParseHTML As ParseHTML
    oParseHTML = New ParseHTML(htmlContent)

    With oParseHTML
      Console.WriteLine("<<----XML Content follows---->>" & vbCrLf & .xmlContent)

      Console.WriteLine("<<----Found " & .MaxParams & " parameters---->>")

      Dim i As Integer
      Console.WriteLine("<<----Listing by integer index follows---->>")
      For i = 1 To .MaxParams
        Console.WriteLine("Item [" & i & "] : " & .ParamValue(i))
      Next


      'This is the probably the most relevant portion of the sample output
      Console.WriteLine("<<----Listing by name follows---->>")
      Dim ParamName As String
      ParamName = "test1" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
      ParamName = "test2" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
      ParamName = "test3" : Console.WriteLine("Item [" & ParamName & "] : " & .ParamValue(ParamName))
    End With

    Console.ReadLine()
  End Sub

  Private Class ParseHTML
    Private _HTMLString As String
    Private _xmlDoc As Xml.XmlDocument

    Private _Params As Collection

    Public Sub New(ByVal HTMLString As String)
      _Params = New Collection

      _xmlDoc = New Xml.XmlDocument
      _xmlDoc.LoadXml("<root />")

      Dim RootEle As Xml.XmlElement
      RootEle = _xmlDoc.DocumentElement
      RootEle.InnerXml = HTMLString      'if u are unsure of the html file, handle this in a try catch to handle an incorrect html file

      'Get the table node
      Dim xTableNode As Xml.XmlNode
      xTableNode = RootEle.SelectSingleNode("TABLE")

      Dim xTRNode As Xml.XmlNode
      Dim xTDNode As Xml.XmlNode
      Dim ParamName As String
      Dim ParamValue As String
      'Get each row node
      For Each xTRNode In xTableNode.SelectNodes("tr")      'if u are unsure of the html file, handle this in a try catch to handle a missing TABLE node
        'Get the left column
        xTDNode = xTRNode.SelectSingleNode("td[@align='right']")
        ParamName = xTDNode.InnerText        'if u are unsure of the html file, handle this in a try catch to handle a missing td node
        'Get the right column
        xTDNode = xTRNode.SelectSingleNode("td[@align='left']")
        ParamValue = xTDNode.InnerText        'if u are unsure of the html file, handle this in a try catch to handle a missing td node
        'Save the info
        _Params.Add(ParamValue, ParamName)
      Next
    End Sub

    '<summary>
    'Left for Debugging purposes only and can be elminated in the final release
    '</summary>
    Public ReadOnly Property xmlContent() As String
      Get
        Return _xmlDoc.InnerXml
      End Get
    End Property

    '<summary>
    'Access ParamValue content by Index
    '</summary>
    Public ReadOnly Property ParamValue(ByVal ParamIndex As Integer) As String
      Get
        Return CType(_Params.Item(ParamIndex), String)
      End Get
    End Property

    '<summary>
    'Access ParamValue content by Name of the parameter
    '</summary>
    Public ReadOnly Property ParamValue(ByVal ParamName As String) As String
      Get
        Return CType(_Params.Item(ParamName), String)
      End Get
    End Property

    '<summary>
    'Get the maximum number of Parameters in the html file
    '</summary>
    Public ReadOnly Property MaxParams() As Integer
      Get
        Return _Params.Count
      End Get
    End Property

  End Class
End Module

Sample.htm
<H3>header</H3>
<TABLE border="1" width="80%">
<tr>
<td align="right">test1</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test2</td>
<td align="left"><b>-</b></td>
</tr>
<tr>
<td align="right">test3</td>
<td align="left"><b>RUSSTKB</b></td>
</tr>
</TABLE>

I just added the sample.htm file to my desktop. U may need to set that location correctly in code.

minafawzi's code is spot on using XMLTextReader and is a quick implementation. My code is a VB.NET implementation with a relatively slower speed during runtime, Trust me you wont notice the difference unless ur xml file is at least 10000 odd lines ;)

Of course, all this would fail provided ur html file was not correctly formatted as xml as u have provided in the sample.
Cheers
Duser

Edited by duser2k3, 30 May 2005 - 02:06 PM.

  • 0

#13 minafawzi

minafawzi

    Advanced Member

  • Members
  • PipPipPip
  • 374 posts

Posted 30 May 2005 - 03:03 PM

duser2k3,
this is great , this is the way to deal with HTML files whether they are well formed or not

ThnX
  • 0

#14 tuntundu

tuntundu

    Member

  • Members
  • PipPip
  • 82 posts

Posted 30 May 2005 - 03:50 PM

Hi Duser,
It work for my requirement.
for minafawzi's code I have to add additional start and end root in the Html file.

any way, thanks a lot of Duser and Minafawzi helping me to solve the problem.

BR,
tuntundu
  • 0

#15 duser2k3

duser2k3

    ~ The One ~

  • Veterans
  • PipPipPipPipPip
  • 2763 posts
  • Location:@ Sadikhov at least 70% of my time

Posted 30 May 2005 - 04:02 PM

minafawzi, great job mate. My code doesnt really work for badly formed XML docs, just the ones without a root element. Guess what, the comments were the most difficult part.

tuntundu, maybe u can close this post then and recommend minafawzi for the post of the month too.

Cheers
Duser
  • 0





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users