Archive for 1月 29th, 2008

29
1

Adding the search function for vRteams.

There are three way to implement this. first is using Zend lucene, second is to integrate with nutch, last is phpdig.

Nutch is a web crawler/indexer/search engine that is based on Lucene. It is a Java tool. We are familiar with java. so It should not be a problem. and the integration should be very simple. Rewriting a search.jsp so that it return XML or JSON rather than HTML, and configure the crawler (DB & file), then have the PHP code place a GET to that page and parse the results. obviously the search server run astand alone even be on different machine. so It can serve for any PHP APP in the approach. The only drawbacks is that we need run a JVM.

PhpDig is a web spider and search engine written in PHP, using a MySQL database and flat file support. but It’s not support well ,so I don’t dig in.

The vRteams is write in PHP. so the best solution is base on PHP (Zend lucene) and it seems good. but, As we know zend lucene is just a index engine with only a HTML parse.If we use zend lucene, regardless of the performance, there are too much thing left to the applications. Our application needed to index PDFs, PowerPoint presentations, Excel spreadsheets, and Word documents and more format will come in the future. we also need to handle updating and maintenance the content, scheduling the crawl job by ourself. Of course, If we have enough time, we can do all of this step by step. At the end of this article is the implemention of the first part Parse Multiple Binary File Types.

At the end of this article give a implemention of the first part Parse Multiple Binary File Types.
I did an extensive search for other parsing tools, but didn’t find the parse write by php. so the solution is based on the tools listed below:
* pdftotext (for parsing PDFs): a full blown PDF reader that also provides numerous PDF and PS utilities.
* catdoc (for parsing Word documents): a set of parsers and utilities including:
* catppt (for parsing Powerpoint documents)
* xls2csv (for parsing Excel documents)

//draft version

  1. class BinaryFileParser
  2. {
  3. function &parseFile( $sFileName )
  4. {
  5.  
  6. //The number below is the maximum number of characters that we will
  7. //allow publish to attempt to index per document
  8. $iCharacterLimit = 250000;
  9.  
  10. // save the buffer contents
  11. $sBuffer =& ob_get_contents();
  12.  
  13. ob_end_clean();
  14. ob_start();
  15. $sExtension = strtolower(substr($sFileName,-3,3));
  16.  
  17. if(file_exists($sFileName))
  18. {
  19.  
  20. $this->customLog("filename: " . $sFileName . "\n");
  21.  
  22. switch($sExtension):
  23. case "pdf":
  24. $sCommand = "pdftotext -nopgbrk -enc UTF-8 " . $sFileName . " -";
  25. break;
  26. case "doc":
  27. $sCommand = "catdoc " . $sFileName . "";
  28. break;
  29. case "xls":
  30. $sCommand = "xls2csv -c -q0 " . $sFileName . "";
  31. break;
  32. case "ppt":
  33. $sCommand = "catppt " . $sFileName . "";
  34. break;
  35. default:
  36. $this->customLog("Invalid File Type\n\n");
  37. return false;
  38. endswitch;
  39.  
  40. $aSpec = array(
  41. 0 => array("pipe", "r"), // stdin is a pipe that the child will read from
  42. 1 => array("pipe", "w"), // stdout is a pipe that the child will write to
  43. 2 => array("pipe", "w") // stderr is a pipe that the child will write to.
  44. );
  45.  
  46. $pHandle = proc_open($sCommand, $aSpec, $aPipes);
  47.  
  48. while (!feof($aPipes[1]) )
  49. {
  50. $sData .= fread($aPipes[1], 8192);
  51. }
  52. while (!feof($aPipes[2]) )
  53. {
  54. $sError .= fread($aPipes[2], 8192);
  55. }
  56.  
  57. if($sError)
  58. {
  59. //TODO do some log here.
  60. }
  61.  
  62. $bReturn = fclose($aPipes[1]);
  63. $bReturn = fclose($aPipes[2]);
  64.  
  65. $iExitCode = proc_close($pHandle);
  66.  
  67. $sData = preg_replace("([^A-Za-z\d\n])", " ", $sData);
  68.  
  69. if($sExtension != "pdf")
  70. {
  71. $sData = utf8_encode($sData);
  72. }
  73.  
  74. //Trim Data down to acceptable size.
  75. $sData = substr($sData, 0, $iCharacterLimit);
  76.  
  77. } //if file exists
  78. else
  79. {
  80. $this->customLog("$sFileName was missing...\n");
  81. $sData = "";
  82. }
  83.  
  84. ob_end_clean();
  85.  
  86. // fill the buffer with the old values
  87. ob_start();
  88. print($sBuffer);
  89. return $sData;
  90.  
  91. } //end method parseFile()
  92. }