Welcome to Dream.In.Code
Getting VB.NET Help is Easy!

Join 136,265 VB.NET Programmers for FREE! Get instant access to thousands of VB.NET experts, tutorials, code snippets, and more! There are 2,212 people online right now. Registration is fast and FREE... Join Now!




Text extraction from DOC and PDF

 
Reply to this topicStart new topic

Text extraction from DOC and PDF, Trying to retrieve text from doc and pdf files... help

Human.animal
13 Oct, 2008 - 03:08 AM
Post #1

New D.I.C Head
*

Joined: 13 Oct, 2008
Posts: 5


My Contributions
Dear friends,
I am new to this site and just registered. I've been looking for an answer to my question before puting it up, as i like to learn from my mistakes, but i prefer to learn from others'...
As i didn't find it, here it goes:
I am trying to make a fairly simple program, that will open a file, read it's contents, extract up to two email adresses and close the file. It's working so/so with word docs, but i can't seem to access pdf files.
I have tried PDFBOX and Report.dll, but can't seem to find an efective way to just read the text out of a pdf...
I have read your policy and here's what i have till now:
CODE

Imports System.IO
Imports System.IO.File
Imports System.IO.DirectoryInfo

Public Class FrmSacaCVs
    Private WordApp As New Word.Application()
    Dim FicheiroActivo As Object
    Dim ConsegueAbrirFicheiro, ConsegueAbrirPasta As Boolean
    Dim Dados As IDataObject
    Dim TextoCompleto As String
    Dim NullObj As Object = System.Reflection.Missing.Value
    Dim Doc As Word.Document
    Dim Caracter, Comprimento, InicioMail, FimMail, Estatuto, LetrasNome, LetrasServidor, LetrasCom As Integer
    Dim MMail, SMail As String
    Dim caracteravaliado As Char
    Dim FicheirosProcessados As Integer = 0
    Dim FicheirosManuais As Integer = 0
    Dim di As DirectoryInfo
    Dim FicheirosnaPasta As FileInfo()


    Dim TotaldeFicheiros, RodaFicheiros, Processados, Manuais As Integer
    Dim ListaProcessados(20000, 2) As String
    Dim ListaManuais(10000) As String
    Dim FicheiroCorrente, PastaCorrente As String

    Private Sub ProcessaFicheiroToolStripMenuItem_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles ProcessaFicheiroToolStripMenuItem.Click
        Try
            With OpenFileDialog1
                .InitialDirectory = "C:\"
                .Filter = "Docs Word (*.doc)|*.doc;*.rtf;*.txt|PDFs|*.prf"
                .ShowDialog()
            End With
            FicheiroActivo = OpenFileDialog1.FileName
            Dim Fich As String
            Fich = FicheiroActivo.ToString
            Fich = Fich.ToLower
            LblFich2.Text = Fich
            If Fich.Substring(Fich.Length - 3, 3) = "doc" Or Fich.Substring(Fich.Length - 3, 3) = "rtf" Or Fich.Substring(Fich.Length - 3, 3) = "txt" Then
                ' MsgBox(Fich)
                AbreDoc(FicheiroActivo)
            Else
                If Fich.Substring(Fich.Length - 3, 3) = "pdf" Then
                    AbrePDF(FicheiroActivo)
                Else
                    TextoCompleto = ""
                    MsgBox("Não é documento compatível")
                End If
            End If
            RTB1.Text = ""
            FicheiroActivo = ""
        Catch ex As Exception

        End Try
        
    End Sub
    Private Sub ProcessaPastaToolStripMenuItem_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles ProcessaPastaToolStripMenuItem.Click
        'processa uma pasta cheia
        Try
            With OpenFolderDialog1
                'browse pasta
                'FicheirosnaPasta = 4
                .ShowNewFolderButton = False
                .RootFolder = Environment.SpecialFolder.MyComputer
                .ShowDialog()
                PastaCorrente = OpenFolderDialog1.SelectedPath
                LblPasta2.Text = PastaCorrente
                di = New DirectoryInfo(PastaCorrente)
                FicheirosnaPasta = di.GetFiles()
                Dim fiTemp As FileInfo
                For Each fiTemp In FicheirosnaPasta
                    'roda ficheiros
                    FicheiroActivo = PastaCorrente & "\" & fiTemp.Name
                    Dim Fich As String
                    Fich = FicheiroActivo.ToString
                    Fich = Fich.ToLower
                    LblFich2.Text = Fich
                    If Fich.Substring(Fich.Length - 3, 3) = "doc" Or Fich.Substring(Fich.Length - 3, 3) = "rtf" Or Fich.Substring(Fich.Length - 3, 3) = "txt" Then
                        'MsgBox(Fich)
                        AbreDoc(FicheiroActivo)
                    Else
                        If Fich.Substring(Fich.Length - 3, 3) = "pdf" Then
                            AbrePDF(FicheiroActivo.ToString)

                        Else
                            TextoCompleto = ""
                            MsgBox("Não é documento compatível")
                        End If
                    End If
                    RTB1.Text = ""
                    FicheiroActivo = ""



                    'Falta inserir um timer para controlar acessos


                Next

            End With
        Catch ex As Exception
            MsgBox("Não foi seleccionada pasta válida")
        End Try
    End Sub

    Private Function AbreDoc(ByVal Ficheiro As Object) As Boolean
        Dim WordApp As New Word.Application()
        ConsegueAbrirFicheiro = False
        Try
            Doc = WordApp.Documents.Open(Ficheiro, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj, NullObj)
            SacaDadosDoc()
            Doc.Close(NullObj, NullObj, NullObj)
            WordApp.Quit(NullObj, NullObj, NullObj)
            ConsegueAbrirFicheiro = True
        Catch ex As Exception
            ConsegueAbrirFicheiro = False
            Doc = ""
            TextoCompleto = ""
            MsgBox("Não foi aberto qualquer ficheiro")
        End Try
        
    End Function
    Private Function AbrePDF(ByVal Ficheiro As Object) As Boolean
        Try
            'Dim PDDocument As PDDocument = PDDocument.load(Ficheiro)
            'Dim textstripper As New pdftextstripper
            'TextoCompleto = textstripper.gettext(PPDocument)
            RTB1.Text = TextoCompleto
            MMail = ""
            SMail = ""
            SacaEmail()
            If MMail = "" Then
                ListaManuais(Manuais) = FicheiroActivo
                GrdManuais.Rows.Insert(Manuais, FicheiroActivo)
                Manuais = Manuais + 1
            Else
                ListaProcessados(Processados, 0) = FicheiroActivo
                ListaProcessados(Processados, 1) = MMail
                ListaProcessados(Processados, 2) = SMail
                GrdProcessados.Rows.Insert(Processados, FicheiroActivo, MMail, SMail)
                Processados = Processados + 1
            End If

        Catch ex As Exception
            MsgBox("Não é possível sacar do PDF")

        End Try
        '###ORIGINAL
        'Private Function TransformPdfToText(ByVal SourceFile As String) As String

        '   Dim PDDocument As PDDocument = PDDocument.load(SourceFile)

        '  Dim TextStripper As New PDFTextStripper

        ' Return TextStripper.getText(PDDocument)





    End Function
    Private Sub SacaDadosDoc()
        Try
            Doc.ActiveWindow.Selection.WholeStory()
            Doc.ActiveWindow.Selection.Copy()
            Dados = Clipboard.GetDataObject()


            'Do whatever with the text.
            TextoCompleto = Dados.GetData(DataFormats.Text).ToString()

            'Close doc and shutdown Word application.
            RTB1.Text = TextoCompleto
            MMail = ""
            SMail = ""
            SacaEmail()
            If MMail = "" Then
                ListaManuais(Manuais) = FicheiroActivo
                GrdManuais.Rows.Insert(Manuais, FicheiroActivo)
                Manuais = Manuais + 1
            Else
                ListaProcessados(Processados, 0) = FicheiroActivo
                ListaProcessados(Processados, 1) = MMail
                ListaProcessados(Processados, 2) = SMail
                GrdProcessados.Rows.Insert(Processados, FicheiroActivo, MMail, SMail)
                Processados = Processados + 1
            End If

        Catch ex As Exception
            MsgBox("Não consigo sacar dados")
        End Try
    End Sub
    Private Sub SacaEmail()
        renova()
        MMail = ""
        SMail = ""
        LblMMail2.Text = ""
        LblSMail2.Text = ""
        TextoCompleto = RTB1.Text
        Comprimento = TextoCompleto.Length
        For Caracter = 0 To Comprimento - 1
            caracteravaliado = TextoCompleto(Caracter)

            If Estatuto = 7 Then 'zeboga@hotmail.com
                If VLetra(caracteravaliado) = "ponto" Then
                    'se tiver ponto
                    If VLetra(TextoCompleto(Caracter + 1)) = "letra" Or VLetra(TextoCompleto(Caracter + 1)) = "caracter" Then
                        If VLetra(TextoCompleto(Caracter + 2)) = "letra" Or VLetra(TextoCompleto(Caracter + 2)) = "caracter" Then
                            FimMail = Caracter + 3
                            gravamail()
                            renova()
                        Else
                            renova()
                        End If
                    End If
                End If
            End If

            If Estatuto = 6 Then 'zeboga@hotmail.co
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    If VLetra(TextoCompleto(Caracter + 1)) <> "ponto" Then
                        If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                            'se for letra
                            LetrasCom = LetrasCom + 1
                        Else
                            FimMail = Caracter + 1
                            gravamail()
                            renova()
                        End If
                    Else
                        Estatuto = 7
                    End If
                Else
                    If VLetra(caracteravaliado) = "ponto" Then 'zeboga@hotmail.co.
                        If VLetra(TextoCompleto(Caracter + 1)) = "letra" Or VLetra(TextoCompleto(Caracter + 1)) = "caracter" Then
                            If VLetra(TextoCompleto(Caracter + 2)) = "letra" Or VLetra(TextoCompleto(Caracter + 2)) = "caracter" Then
                                FimMail = Caracter + 3
                                gravamail()
                                renova()
                            Else
                                renova()
                            End If
                        End If
                    Else
                        FimMail = Caracter
                        gravamail()
                        renova()
                    End If
                End If
            End If

            If Estatuto = 5 Then 'zeboga@hotmail.c
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    LetrasCom = 2
                    Estatuto = 6
                Else
                    renova()
                End If
            End If

            If Estatuto = 4 Then 'zeboga@hotmail.
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    LetrasCom = 1
                    Estatuto = 5
                Else
                    renova()
                End If
            End If

            If Estatuto = 3 Then 'zeboga@h
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    LetrasServidor = LetrasServidor + 1
                Else
                    If VLetra(caracteravaliado) = "ponto" Then
                        If LetrasServidor >= 2 Then
                            Estatuto = 4
                        Else
                            renova()
                        End If
                    End If
                End If
            End If

            If Estatuto = 2 Then 'zeboga@
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    LetrasServidor = 1
                    Estatuto = 3
                Else
                    renova()
                End If
            End If

            If Estatuto = 1 Then 'zeboga
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Or VLetra(caracteravaliado) = "ponto" Then
                    LetrasNome = LetrasNome + 1
                Else
                    If VLetra(caracteravaliado) = "hat" Then
                        If LetrasNome >= 2 Then
                            Estatuto = 3
                        Else
                            renova()
                        End If
                    Else
                        renova()
                    End If
                End If
            End If

            If Estatuto = 0 Then 'z
                If VLetra(caracteravaliado) = "letra" Or VLetra(caracteravaliado) = "caracter" Then
                    InicioMail = Caracter
                    LetrasNome = 1
                    Estatuto = 1
                End If
            End If
        Next Caracter




    End Sub
    Private Function VLetra(ByVal letra As Char) As String
        If (letra >= "a" And letra <= "z") Or (letra >= "A" And letra <= "Z") Or (letra >= "0") And (letra <= "9") Then
            Return "letra"
        Else
            If letra = "." Then
                Return "ponto"
            Else
                If letra = "@" Then
                    Return "hat"
                Else
                    If (letra = "-") Or (letra = "_") Then
                        Return "caracter"
                    Else
                        Return ""
                    End If
                End If
            End If
        End If
    End Function
    Private Sub renova()
        InicioMail = 0
        Estatuto = 0
        LetrasNome = 0
        LetrasServidor = 0
        LetrasCom = 0
        FimMail = 0
    End Sub
    Private Sub gravamail()
        If MMail = "" Then
            MMail = TextoCompleto.ToString.Substring(InicioMail, FimMail - InicioMail)
            With RTB1
                .Select(InicioMail, FimMail - InicioMail)
                .SelectionBackColor = Color.LightGreen
                .SelectionColor = Color.Red
            End With
            LblMMail2.Text = MMail
        Else
            SMail = TextoCompleto.ToString.Substring(InicioMail, FimMail - InicioMail)
            With RTB1
                .Select(InicioMail, FimMail - InicioMail)
                .SelectionBackColor = Color.LightGreen
                .SelectionColor = Color.Red
            End With
            LblSMail2.Text = SMail
        End If
    End Sub
    Private Sub SairToolStripMenuItem_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles SairToolStripMenuItem.Click
        'fecha a aplicação

        Me.Close()

    End Sub
End Class





I hope someone out there can begin to understand my code (I know it's not very well organized) and sugest a solution.
Thanks for your time
Human.animal
User is offlineProfile CardPM
+Quote Post

Human.animal
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 05:06 AM
Post #2

New D.I.C Head
*

Joined: 13 Oct, 2008
Posts: 5


My Contributions
Hello all,

I am trying to read the text from a pdf file, but can't seem to make it.
I have posted before (http://www.dreamincode.net/forums/showtopic67352.htm), but maybe the post was too long, so no replies. I know there is a solution somewhere. Can anyone help?

Thanks
Human.animal
User is offlineProfile CardPM
+Quote Post

Nykc
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 05:12 AM
Post #3

sudo rm -R /
Group Icon

Joined: 14 Sep, 2007
Posts: 4,132



Thanked: 16 times
Dream Kudos: 275
My Contributions
Do you have any code showing what you are trying to do to read the file. It might be helpful?
User is offlineProfile CardPM
+Quote Post

PsychoCoder
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 05:31 AM
Post #4

using DIC.Core;
Group Icon

Joined: 26 Jul, 2007
Posts: 8,983



Thanked: 125 times
Dream Kudos: 8625
Expert In: VB, VB.Net, C#, SQL, ASP, ASP.Net, Web Development, HTML, CSS, Win32 API, Javascript, mySQL, J#, Boo.Net

My Contributions
Please don't create duplicate topics, it isn't going to get you a solution any faster, and in most cases will slow down the process of getting a solution. Topics merged
User is offlineProfile CardPM
+Quote Post

Human.animal
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 05:40 AM
Post #5

New D.I.C Head
*

Joined: 13 Oct, 2008
Posts: 5


My Contributions
QUOTE(PsychoCoder @ 15 Oct, 2008 - 06:31 AM) *

Please don't create duplicate topics, it isn't going to get you a solution any faster, and in most cases will slow down the process of getting a solution. Topics merged

Thanks, PsychoCoder, I'm glad you looked into it. Do you know of any way to do this?
Won't duplicate again, sorry.
Regards, H.A.
User is offlineProfile CardPM
+Quote Post

magicmonkey
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 07:01 AM
Post #6

D.I.C Regular
***

Joined: 12 Sep, 2008
Posts: 413



Thanked: 68 times
My Contributions
I advise that you look into SQL Server Full-Text search. You need Adobe IFilter to get support for PDFs, it use to be a standalone plugin, now I believe it is include with Acrobat Reader. Why re-invent the wheel on this one.
User is offlineProfile CardPM
+Quote Post

Human.animal
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 07:18 AM
Post #7

New D.I.C Head
*

Joined: 13 Oct, 2008
Posts: 5


My Contributions
QUOTE(magicmonkey @ 15 Oct, 2008 - 08:01 AM) *

I advise that you look into SQL Server Full-Text search. You need Adobe IFilter to get support for PDFs, it use to be a standalone plugin, now I believe it is include with Acrobat Reader. Why re-invent the wheel on this one.

But i'm not using sql at all.
As stated on my first message. I am retrieving e-mail adresses from documents. This is working for doc files, and I want to add support for pdf files aswell. Is there no way to do a simple fil open, read, and close of pdf files?

User is offlineProfile CardPM
+Quote Post

magicmonkey
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 07:24 AM
Post #8

D.I.C Regular
***

Joined: 12 Sep, 2008
Posts: 413



Thanked: 68 times
My Contributions
Hmmm... what are you going to do with these email addresses, SPAM people?
User is offlineProfile CardPM
+Quote Post

Human.animal
RE: Text Extraction From DOC And PDF
15 Oct, 2008 - 08:18 AM
Post #9

New D.I.C Head
*

Joined: 13 Oct, 2008
Posts: 5


My Contributions
QUOTE(magicmonkey @ 15 Oct, 2008 - 08:24 AM) *

Hmmm... what are you going to do with these email addresses, SPAM people?


against spam.gif

There would be easier and faster ways to get email adresses for spam. I am not interested in that.
What i am doing is retrieving email from candidates' resumés that i have received, in doc and pdf files, so i can contact them directly. I have around 20 000 resumés and figure it's faster to program this than to open one by one and copy/paste the email.


User is offlineProfile CardPM
+Quote Post

Fast ReplyReply to this topicStart new topic
Time is now: 12/2/08 04:43AM

Live VB.NET Help!

VB.NET Tutorials

Reference Sheets

VB.NET Snippets

DIC Chatroom

Bye Bye Ads

Monthly Drawing

Thumb Drive

Top Contributors

Top 10 Kudos This Month