[SOLVED] PDF text extraction. Need Help!

sejal-kacharia · March 7, 2023, 9:11pm

I am trying build a pdf text extraction feature, where i want to extract all text from the pdf once the user uploads it.
I have read tons of resources on how to do this in javascript and tried most of them e.g pdf-parse, express.js etc. However i am not able to successfully do this

Below are some examples i have tried

Extracting Text From A PDF Using Only Javascript — Hublog
Using express.js and javacript. See the video here https://www.youtube.com/watch?v=enfZAaTRTKU&t=216s
Using ( ‘pdf-parse’ ) and ( ‘fs-extra’ ) npm modules.

Below is my code for the 3.

HTML iframe code - Creates a file selector and then calls reader to read the file as ArrayBuffer. I get the arrayBuffer correctly.

Client side code when message is received upon file selection
$w . onReady ( async ()=>{
$w ( ‘#html1’ ). onMessage (( msg )=>{
console . log ( "File details " + msg . data );
console . log ( "File name " + msg . data . fileName );
console . log ( "File type " + msg . data . fileType );
console . log ( "File size " + String ( msg . data . fileSize ));
console . log ( "File Content " + msg . data . content ); //Array buffer from the html component

   const  dataBuffer  =  msg . data . content 
   console . log ( "Data buffer " + dataBuffer );

//Pass the array buffer to the backend file
let pdfText = parseTextFromPDFBuffer ( dataBuffer )
. then (( data ) => {
if ( data !== undefined ) {
//user already has a profile created
console . log ( "PDF Text extracted " + data . text );
$w ( “#pdftext” ). text = data . text
} else {
console . log ( "PDF Text extracted EMPTY " );
}
})
. catch (( err ) => {
console . log ( err );
});
});

Backend Code
export async function parseTextFromPDFBuffer ( dataBuffer ){
console . log ( "Array Buffer " + dataBuffer ); // -----------------> on print it shows this object as [object Object] instead of [Object ArrayBuffer] curious to understand why this would happen simply during passing this buffer as param to backend.

console . log ( “Inside backend parse code” )
const pdf = require ( ‘pdf-parse’ );
const fs = require ( ‘fs-extra’ );
const https = require ( ‘https’ );

const options = {
// internal page parser callback
// you can set this option, if you zneed another format except raw text
pagerender : render_page ,
// max page number to parse
max : 0 ,
//check Getting Started
version : ‘v1.10.100’
}

console . log ( " Option " + options );
//let buff = toUint8Buffer(dataBuffer);
//console.log(“Buffer”+buffer);
//let buff = new Uint8Array([dataBuffer])
// Use byteLength to check the size
console . log ( "Array Buffer " + dataBuffer );

//I am not sure if i am converting this correctly from arrayBuffer to buffer, can someone confirm if this is the right way
const buf = Buffer . from ( dataBuffer );
console . log ( " Buffer " + buf );

// the pdf method i think accepts only buffer not Array buffer type so i tried converting from array buffer to buffer but it returns undefined. I think this might be related to the issue above where arrayBuffer parameter is getting converted to object ?

return pdf ( dataBuffer , options ) // I have tried passing both array buffer and converted buffer object here but none of them work.
. then ( function ( data ) {
//use new format
console . log ( "Got the data " + data );
// number of pages
console . log ( “No of Pages” + data . numpages );
// number of rendered pages
console . log ( “Data num renderer” + data . numrender );
// PDF info
console . log ( “Data info” + data . info );
// PDF metadata
console . log ( “Data metadata” + data . metadata );
// PDF.js version
// check Getting Started
console . log ( “Data version” + data . version );
// PDF text
console . log ( “Data text” + data . text );
return data ;
})
. catch ( function ( error ){
// handle exceptions
console . log ( "PDF parse throws error " + error );
return error ;
})

This function returns error and says contact side administrator. Can anyone help me with the issue?

Here is the sample code of extracting text on pdf-parse example. The only thing different i am doing is using the file selector to select file and using the reader.readAsArrayBuffer to generate the buffer. Once the buffer is generated i use the same pdf () method here and pass my array buffer.

Example link - https://www.npmjs.com/package/pdf-parse

I have been stuck here for more than a week and would really appreciate any help/suggestions. TIA

sejal-kacharia · March 7, 2023, 9:13pm

@russian-dima or anyone have any suggestions ?

CODE-NINJA · March 7, 2023, 9:21pm

When i find the time, i surely will take a look onto it.
Like i already wrote here…

I already assumed all these issues😉

You should connect all your POSTS, which are related to each other.

Tim_Baudouin · September 25, 2024, 10:47pm

Hey dude, I have the exact same project. Have you succeed to do it ?

Topic		Replies	Views
Using a NPM package in code Ask the community code , question	14	621	April 17, 2025
Fetch PDF Ask the community code , question	5	327	January 2, 2023
arrayBuffer() is not a function? Ask the community code , wix-stores , question	1	3865	May 3, 2019
Buffer or URL through Blob of the PDF file Ask the community code , question	1	744	May 31, 2020
Can Buffer.from() get a URL path? Ask the community code , question	13	517	December 6, 2020

[SOLVED] PDF text extraction. Need Help!

Related topics