You are here: Home > Forum > Extract East asian characters from PCT application

Exchange information, experiences, insights

Extract East asian characters from PCT application

Hello,

I am searching a tool to extract all the two-bytes encoded characters from a PCT in .pdf format. With such an extraction tool it would be possible to machine translate these applications before their national phases. Does anyone has some good sources to share?

 I know this subject is not related to EPO but I think this forum is a good place to share those information.

3 Comments

Yoochan writes on Jun 3, 2008 8:54:23 AM
(Institution/Organisation: Korea Institute of Patent Information)

Hello Roch,

This is Choi, YooChan from Korea Institute of Patent Information.
It seems you are looking for a software which can extract two-bytes characters from PDF documents.

I found one software which is offerring from Synapsoft, a software developing company in Korea.
I tested this software with PDF documents written in Korean and Japanese, and it worked fine. It is also possible to extract characters from Excel, DOC, PPT files, but not image PDF or other image files.

The URL of downloading the software is as following; http://www.synap.co.kr/next/docuinfo.jsp.
It is free-ware so you can test it after installing the file.

By the way, they don't have Enlish website, but you can find the download icon in the middle of the page outlined in blue.

I hope this is useful to you.

Tsuyoshi writes on Jun 12, 2008 8:37:49 AM
(Institution/Organisation: Japan Patent Information Organization)

Hello, Roch,

This is Tsuyoshi Kakita of Japan Patent Information Organization.
Thank you for your inquiry.

It seems that the PDF documents provided via the esp@cenet Worldwide database are converted from image files, not from text files. I guess such documents cannot be re-converted to text format directly. The only way to convert them in text format is therefore by using OCR.

Unfortunately, I couldn't find a good OCR software which can be used free of charge. But I could find the software called "XeloReader," which you can use for free for 180 days.

I tested this software by converting the PDF file of WO2008066105, a PCT document from the Worldwide database containing Japanese characters, and it worked fine.

The URL for downloading this software is as following.
http://www.vector.co.jp/soft/dl/winnt/writing/se435636.html

The download icon is the blue one in the middle of the page.

When the installation of software is completed, the shortcut icon will be appeared in the desktop. Drop the PDF file on it, and the file will be automatically opened.

You will find a lot of icons in the tool bar on the top of the window. The rightmost one is for the OCR conversion (You can read "OCR" on the icon). Just click on it, and the displayed file will be converted to text-based PDF file. In my case, the new PDF file named "WO2008066105_OCR" is created on the desktop.

This new PDF file is text-based and you can convert it to the text file just by opening it by Adobe Reader and saving it as a text file.

For your information, the XeloReader is a product of Xelo, Inc.
Their English website is:
http://xelo.jp/eng/

I hope this information helps.
Best regards.

brandonliuhb writes on Jun 19, 2008 1:23:48 PM
(Institution/Organisation: Intellectual Property Publishing House, SIPO)

The Chinese OCR software can provide the convertion from image file to text file.The free evaluation software download address is http://down.x6x8.com/soft/softdown.asp?softid=245.

Back to top

Log in to add your comment!




Forgot your password?

» Not signed up yet?

counter