Example of use: OCR Text Recognition

Enhanced OCR Text Recognition

The text in PDF files can be placed as a native PDF text, as a text deconstructed in lines, as a text deconstructed in hatches, and as a text presented in raster pictures.

To recognize this kind of text the program uses artificial intelligence methods of OCR (Optical Character Recognition) and Symbol Recognition.

Simple OCT Text Recognition allows to recognize horizontal or vertical non-native PDF text.

Enhanced Text recognition allows the to recognize non-native PDF text in construction plans with different text slopes. The slope of a text will be defined using a special editor.

OCR Parameter: Primary and Secondary Text Slope

You can select 1 or 2 simultaneous text directions. The combinations "vertical with vertically from above" or "horizontal with upside down" should not be selected.

The "Primary Text" is the text that occurred most in the converted drawing. If the drawing is e.g. is rotated at 90 degrees and most texts are vertical and only a few "upside down", then the primary text is the text at 90 degrees tilt.

OCR Parameter: Text Language

The right text language selection helps to build the right words. Print2CAD uses artificial intelligence methods for the text control and an internal dictionary to eliminate unusual text combinations.

Text Representation

The right selection of Text representation is very important for correct text recognition.

The text for OCR text recognition can be placed in PDF as native Text, as text deconstructed in lines or path, as text deconstructed into hatches, or pixel pictures with a text.

The Analysis of a PDF file shall be done before the activation of a text representation. The Analysis of a PDF file shows in separate pictures what kind of text representation is used in the input PDF file.

If you find more than one text representation, choose all of it.

The Text needs for a correct OCR the line weight between 10%-20% of the text height. If the line weight in the text is less or more, change the minimum or maximum line weight using the interface function.

PDF Text Presented by
Raster Pictures

PDF Text Presented by
Boundary Hatches

PDF Text Presented by
Polylines

OCR Parameter: Picture Resolution in DPI

The right resolution for OCR text recognition is very important. The resolution has to be as low as possible, but the text has to be very clear and readable. Try first with 300 DPI and push the button “Preview”. If the smallest text is not readable, increase the resolution 50 DPI steps.

OCR Parameter: Minimum and Maximum Text Height in Pixel

The parameter for maximum and minimum text height are very important. The preseparation of a text works based on this parameter. Push the button “Preview”. If you see that not all text are separated then increase the maximum height. If you see a lot of free pixels are separated increase the minimum height.

OCR Parameter: Image Threshold

If you choose the raster images as text representation, the threshold decides what pixel belongs to the color black group and what pixels belongs to the white background. Push the button “Preview”. If you see that the text letters connect to each other then decrease the threshold.

Raster Image with Gray Scale Intensity

After applying a Threshold of 120

Enhanced OCR of Text: Text Areas

The OCR Text recognition only works if the right text direction can be detected. Unfortunately, in one construction plan the text can exist in very different directions.

A manual preseparation of the text areas with a common direction is needed for well done OCR text recognition.

Print2CAD offers a special editor for these text areas.

One “Text Area” will be defined with the help of 3 points. The first two points give the text direction and the third point gives the right upper corner of a text box.

In the text area editor you can choose different boxes for “Text Area” and for “Number Area”.

“Text Area” recognizes letters, numbers, and special characters like “+”, “-“ etc. If a number and a letter are in question (like the letter “l” and number “1”) the recognition will choose the letter “l”.

“Text Number” recognizes numbers, letters, and special characters like “+”, “-“ etc. If a number and letter are in question (like the letter “l” and number “1”) the recognition will choose the number “1”.

Tips

  • Try to separate numbers and letters in different text areas.
  • Try to separate in one text area only text with a common or similar text height.
  • Try to separate clean text areas with no disruption from other drawing elements.

Pre-Generation of Text Areas

OCR text recognition works better if the text areas are specified precisely. Print2CAD offers the possibility to automatically generate the text areas.

The generated text area can be in four directions:

  • horizontal (text slope 0 degrees)
  • vertical (text slope 90 degrees)
  • upside down (text slope 180 degrees)
  • vertically from the top (text slope 270 degrees)

The automatically generated text can be corrected, selectively deleted or extended by activating "Enhanced OCR text recognition" with the text area editor.

Preview

With "Preview", the automatically generated text areas can be viewed visually. If the corrections are necessary, then "Enhanced OCR text recognition" should be activated.