Writing a simple search engine in PHP

Writing a search engine for your LAMP server is actually fairly simple. I realized I could simply use grep for the search algorithm and have the search engine use regular expression metacharacters. I mean, why reinvent the wheel? Of course, a search engine that crawls the entire web will be more complicated; this one just searches the local web server.

Here is the code I’ve written. It uses less than ten lines of PHP code, and even what I have here has some parts that are rather superfluous (I should probably streamline it)…

 1 <!DOCTYPE html>
 2 <!-- A simple search engine written in PHP -->
 4 <html>
 5 <head>
 6 <title>Search</title>
 7 </head>
 9 <body>
10 <?php
11 if( is_null( $_GET['query'] ) ){
12 ?>
13 <!-- No query performed yet -->
14 <form action="<?php echo $_SERVER['PHP_SELF']?>" method="get">
15 Enter a query:<br>
16 <input name="query" type="text"><br>
17 <input type="submit" value="Search">
18 </form>
19 <?php } else { ?>
20 <!-- Query has been performed -->
21 <?php
22 exec( "grep -ril \". $_GET['query'] . "\" *", $files );
23 $length = sizeof( $files );
24 for( $i = 0$i < $length$i++ ){
25         echo( "<a href=\". $files[$i] . "\">. $files[$i] . "</a><br>\n);
26 }
27 ?>
28 <?php } ?>
29 </body>
30 </html>

Since the search algorithm is simply a front-end for grep, I didn’t really have to think about its implementation. Basically, the script has a decision statement that looks at the superglobal variable $_GET['query']. If it’s null, that means the query hasn’t been submitted yet, so it shows the prompt for the query. If it’s not null, it shows the results of the query, which is of course a regular expression. The results are obtained by greping the local server filesystem and returning all files that contain that pattern.

One thing that makes PHP code somewhat confusing is the way you can have a PHP script interleaved with HTML code. It’s one of those things you just have to get used to.

Possible enhancements include:

  • Using egrep instead of grep so the user can use extended regular expressions.
  • Adding the ability to search for images, videos, and other media on the server, rather than just web pages (this could be done by returning any such media that are used by pages that match the pattern).
  • Searching filenames in addition to file contents (you could use find for this).
  • Using a ranking algorithm (currently it just lists them in the canonical order returned by grep).
  • And of course adding CSS and other formatting to make the page more aesthetically pleasing.