Writing a simple search engine in PHP

Writing a search engine for your LAMP server is actually fairly simple. I realized I could simply use grep for the search algorithm and have the search engine use regular expression metacharacters. I mean, why reinvent the wheel? Of course, a search engine that crawls the entire web will be more complicated; this one just searches the local web server.

Here is the code I’ve written. It uses less than ten lines of PHP code, and even what I have here has some parts that are rather superfluous (I should probably streamline it)…


 1 <!DOCTYPE html>
 2 <!-- A simple search engine written in PHP -->
 3 
 4 <html>
 5 <head>
 6 <title>Search</title>
 7 </head>
 8 
 9 <body>
10 <?php
11 if( is_null( $_GET['query'] ) ){
12 ?>
13 <!-- No query performed yet -->
14 <form action="<?php echo $_SERVER['PHP_SELF']?>" method="get">
15 Enter a query:<br>
16 <input name="query" type="text"><br>
17 <input type="submit" value="Search">
18 </form>
19 <?php } else { ?>
20 <!-- Query has been performed -->
21 <?php
22 exec( "grep -ril \". $_GET['query'] . "\" *", $files );
23 $length = sizeof( $files );
24 for( $i = 0$i < $length$i++ ){
25         echo( "<a href=\". $files[$i] . "\">. $files[$i] . "</a><br>\n);
26 }
27 ?>
28 <?php } ?>
29 </body>
30 </html>

Since the search algorithm is simply a front-end for grep, I didn’t really have to think about its implementation. Basically, the script has a decision statement that looks at the superglobal variable $_GET['query']. If it’s null, that means the query hasn’t been submitted yet, so it shows the prompt for the query. If it’s not null, it shows the results of the query, which is of course a regular expression. The results are obtained by greping the local server filesystem and returning all files that contain that pattern.

One thing that makes PHP code somewhat confusing is the way you can have a PHP script interleaved with HTML code. It’s one of those things you just have to get used to.

Possible enhancements include:

  • Using egrep instead of grep so the user can use extended regular expressions.
  • Adding the ability to search for images, videos, and other media on the server, rather than just web pages (this could be done by returning any such media that are used by pages that match the pattern).
  • Searching filenames in addition to file contents (you could use find for this).
  • Using a ranking algorithm (currently it just lists them in the canonical order returned by grep).
  • And of course adding CSS and other formatting to make the page more aesthetically pleasing.
Advertisements

Setting up an Apache HTTP server

Through some config file hacking, I have managed to set up an Apache HTTP server on my Macbook.  I did this so that I could test the full functionality of PHP.  Since PHP is one of the top most needed skills for freelance coding jobs, I figured it would be a good idea to learn it, and of course to use any of the features of PHP beyond just the core language, you need a web server.

Starting the Apache server is pretty easy.  All you have to do is type sudo httpd at the command line (assuming Apache is installed on your system, which I think it is for most Unix-based systems). It is recommended that you use apachectl as a frontend instead of using httpd directly, but I couldn’t seem to get this to work, so to start Apache I use sudo httpd and to stop it I use sudo killall httpd.

Now configuring the server to use PHP was somewhat more difficult, though still not too much so. First of all, for a server to use PHP, the PHP DOS initialization file needs to be present as /usr/local/lib/php.ini. After some digging around, I found the PHP ini file at etc/php.ini.default, so I just copied it (changing the filename of course).

The next thing I had to do was tell Apache to load the PHP module at startup. This is done by editing the file /etc/apache2/httpd.conf and uncommenting the appropriate code line.  It must be remembered that editing this file requires root privileges.

httpd.conf-uncomment

The appropriate line is


LoadModule php5_module libexec/apache2/libphp5.so

…Shown here already uncommented.

The next thing you have to do is find out what directory Apache is using to serve files to clients. This is determined by the DocumentRoot environment variable, and controlled by a <Directory> tag.

http.conf-root

Here we see that the server’s filesystem is rooted at /Library/WebServer/Documents. Of course this is Mac-specific, and the root will be different on other systems, and we can also change it, though I felt no need to.

If you title a document “index.html”, “index.php”, etc. then this will be the file that the client goes to when the user types your domain name without appending a path at the end. Also, if you title a document, say, my-pictures.html, the file extension can be omitted in the URL.