History: OCR Indexing

Source of version: 17 (current)

Copy to clipboard

            !  {{page}}
Since ((Tiki20)), ((file galleries)) can index the contents of files with images uploaded to Tiki, by means of "Optical Character Recognition" (OCR), and take the result to feed the ((search index)) also.


Tiki relies on https://github.com/tesseract-ocr/tesseract so you need to install as per https://tesseract-ocr.github.io/tessdoc/Installation.html

If you are using WikiSuite, Tesseract is installed by default: https://wikisuite.org/Differences-between-Virtualmin-and-WikiSuite

((Server Check)) helps you confirm that Tesseract is working well, and available to Tiki.

!! Required Preferences
To enable OCR indexing in Tiki, make sure to activate the following preference:
''ocr_enable'': Enables Tiki to extract and index text from supported file types.

!! Optional Preferences
You can further customize OCR behavior with these optional settings:
''ocr_every_file'': If enabled, Tiki will attempt OCR on all supported files, regardless of other criteria.
''ocr_file_level'': Allows users to override the default OCR language settings on a per-file basis.

!! Additional Customization
Tiki also offers several advanced customization options:
* Display OCR status per file.
* Set custom paths for the tesseract and pdfimages binaries via the system $PATH.

The file gallery has two view modes: "Finder view" and the default "List view". The OCR status for files can only be seen in the default "List view" mode of the file galleries ((File Gallery)).

{img src="display2186" link="display2186" width="400" rel="box[g]" imalign="center" desc="This is where you can switch between File Galleries modes — from the default list view to Finder view, and vice versa." align="center" styleimage="border"}

!! Understanding OCR Status in File Galleries
When OCR is enabled, Tiki displays the processing status of each file directly in the File Galleries list. This helps users know whether a file’s text content has been indexed and is searchable.

The image below shows few OCR statuses in action:

{img src="display2184" link="display2184" width="500" imalign="left" }

(1) __No scheduled processing__:
This status means the file hasn’t been marked for OCR yet. This could be because the file type isn’t supported, or the file hasn’t met criteria for OCR processing.

(2) __Queued for processing__:
The file has been added to the OCR queue and will be processed shortly. Once done, the status will change accordingly.

Other possible statuses include:
__Finished processing__
→ OCR has successfully extracted text and indexed the file. The file is now searchable by its content.

__Currently processing__
→ Tiki is actively running OCR on this file.

__Processing stalled__
→ An error occurred or the process is stuck. This may require admin review.

!! Activate a scheduler to OCR all your files in the queue

* In the Control Panels, go to Server → Scheduler (your_tiki/tiki-admin_schedulers.php), enable the scheduler feature if it isn’t already active.
* Click “Add a new scheduler”, choose Console command as the task type, set the Command field to {CODE()}ocr:all{CODE}, give it a clear name such as “OCR Queue Processor”, set the status to Active, and pick whatever frequency fits (daily at 02:00 is a good default if you prefer).

To verify, run __php console.php ocr:all__ once manually; pending files should flip to “Finished processing”. You can also check via Server Check(your_tiki/tiki-check.php) → OCR Status to confirm the scheduler warning disappears and that OCR jobs run automatically thereafter.


Alias names for this page:
(alias(OCR)) | (alias(OCRIndexing)) | (alias(Optical Character Recognition))