cattaSearch backend

In the cattaSearch backend you define your website(s), crawl a website and see a list of website crawls

cattaSearch backend consists of 3 parts:

  1. Define website
  2. Crawl website
  3. Crawl log

 

In the first backend part you define the websites to be searched:

  • The operating system path from where the crawl shall start - required!
  • The web path, only if your crawl and search is confined to part of a website, else leave blank - optional
  • A directory blacklist, ie a list of directories NOT to be included in the crawl and search - optional
  • An extension blacklist, ie a list of file types NOT to be included in the crawl and search - optional
  • A unique name for the database table to store the search repository - required!

You can define as many websites you like. The search data are stored in a separate database table for each website.

A defined website in cattaSearch can also be just a part of your actual website. Then you also have to enter the web path. An example: If you have a photo part of your website, starting with the directory /photos from the root of your website, and you want to create a specific search for this part of your website, then enter '/photos' as your web path.

You can change already defined websites - and delete them when they are no longer needed.

 

In the second backend part you do the actual crawl for a defined website. What it does is:

  • It crawls your website starting from its web path and down through all subdirectories
  • For all known file types and files not included in one of the blacklists it extracts the plain text elements
  • The search data is stored in the MySQL database where it is indexed for fast search

You should crawl your website every time you make changes to it. In every crawl all the old search data are deleted and replaced with new search data.

The following document formats are currently supported for text search by cattaSearch:

  • Web pages (.html and .php)
  • Plain text files (.txt)
  • PDF documents
  • Newer Microsoft Word documents (.docx)
  • Older Microsoft Word documents (.doc)
  • OpenOffice/LibreOffice documents (.odt)

Please note that the package needed for extracting text from PDF documents, pdftotext, is not included in the cattaSearch download file. It is specific to your operating system. Ie you have to download and install it yourself.

 

The third and last backend part of cattaSearch is the websites crawl log where you can see all your crawls with date/time, crawl time in seconds and the number of files found.

 

Leave a Comment

 
Revised: 2020-05-24