Optimization 3:
Native PDF Text Recognition

Recognition of Native PDF Text

  • The text in PDF files can be placed as a native PDF text, as a text deconstructed in lines, as a text deconstructed in hatches, and as a text presented in raster pictures.
  • To recognize this kind of text the program uses artificial intelligence methods of OCR (Optical Character Recognition) and Symbol Recognition.

Conversion of Native PDF texts

  • The native text in PDF files can be placed as strings or individual characters. The best method to find out if your PDF file contains real text is to analyze the PDF file with the analysis function of Print2CAD and see if there are any text entities indicated.
  • Another method is to open the PDF file in a PDF Reader and zoom the text to maximum view. If the letters still have smooth edges, your PDF file most likely features real text. If the edges of the letters are not smooth, Print2CAD will not convert the “text” to real text without activating the OCR function.

Parameter: Convert Native PDF Texts into Editable CAD Texts

  • In PDF files, text is usually defined as separate characters or groups of characters with their own insertion points. With the help of special internal methods, Print2CAD merges characters into strings and places these strings as texts in the DWG or DXF drawings.

Parameter: Convert Native PDF Texts into Hatches

  • Print2Cad converts all Texts into Polylines with filled Areas (Hatches).
  • It is not always possible to extract text from a PDF especially when the Unicode map is missing or “user defined”. There are many construction drawings that use this type of trick to stop people from extracting the data.
  • If it is not possible to cut and paste the correct text from Acrobat then you will have very little chance of converting the text yourself. If Acrobat cannot extract it then it is very unlikely that Print2CAD can extract the text correctly.
  • To convert this text into hatches or to apply OCR functions on it is the only one possibility to handle this kind of text.

Parameter: Visualization of a Text with Corrupt Codec

  • If the font codec and encoding table is manually created then the program Print2CAD will use artificial intelligence methods to find out the right codes.

Parameter: Sort Text Onto Separate Layer

  • When activating this function, all native or recognized text gets sorted onto a predetermined layer. If there are no real text, but only polylines, hatches or raster images, the letters will not be recognized as text.

Parameter: Replace All Fonts With a SHX ot TTF Font

  • When enabling this option, all text styles get the same selected SHX or TTF font assigned.

Parameter: Scale Factors for Blank Space Width

  • Text in PDF files is often placed as single letters. In this case the spaces are not available.
  • When Print2CAD is transforming letters to text, blank spaces get recognized with the help of a substitute space width equating the letter “a.”
  • Should the space detection does not work properly, increase or reduce the substitute space factor according to the below graphic (by trial and error):

Parameter: Scale Factors for Text Width and Height

  • If Print2CAD can’t find the fonts used in the PDF in the Windows system, Print2CAD will select a similar font. In doing so, the text width may change.
  • A workaround for this is the use of scale factors for the text width and height. The text will be scaled by the given factor and placed left-aligned in the CAD drawing.
  • The fonts in PDF files are mostly embedded, so that you do not need the fonts in your Windows system if you display the PDF files.
  • In DWG or DXF files the fonts can not be embedded. You will need all the fonts that are used in the DWG or DXF files installed in your Windows system.
  • Print2CAD is not able to extract PDF embedded fonts into your Windows system.