Xem mẫu

  1. Solr 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more David Smiley Eric Pugh BIRMINGHAM - MUMBAI
  2. Solr 1.4 Enterprise Search Server Copyright © 2009 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: August 2009 Production Reference: 1120809 Published by Packt Publishing Ltd. 32 Lincoln Road Olton Birmingham, B27 6PA, UK. ISBN 978-1-847195-88-3 www.packtpub.com Cover Image by Harmeet Singh (singharmeet@yahoo.com)
  3. Credits Authors Production Editorial Manager David Smiley Abhijeet Deobhakta Eric Pugh Editorial Team Leader Reviewers Akshara Aware James Brady Jerome Eteve Project Team Leader Priya Mukherji Acquisition Editor Rashmi Phadnis Project Coordinator Leena Purkait Development Editor Darshana Shinde Proofreader Lynda Sliwoski Technical Editor Pallavi Kachare Production Coordinator Shantanu Zagade Copy Editor Leonard D'Silva Cover Work Shantanu Zagade Indexer Monica Ajmera
  4. About the Authors Born to code, David Smiley is a senior software developer and loves programming. He has 10 years of experience in the defense industry at MITRE, using Java and various web technologies. David is a strong believer in the opensource development model and has made small contributions to various projects over the years. David began using Lucene way back in 2000 during its infancy and was immediately excited by it and its future potential. He later went on to use the Lucene based "Compass" library to construct a very basic search server, similar in spirit to Solr. Since then, David has used Solr in a major search project and was able to contribute modifications back to the Solr community. Although preferring open source solutions, David has also been trained on the commercial Endeca search platform and is currently using that product as well as Solr for different projects.
  5. Most, if not all, authors seem to dedicate their book to someone. As simply a reader of books, I have thought of this seeming prerequisite as customary tradition. That was my feeling before I embarked on writing about Solr, a project that has sapped my previously "free" time on nights and weekends for a year. I chose this sacrifice and would not change it, but my wife, family, and friends did not choose it. I am married to my lovely wife Sylvie who has sacrificed easily as much as I have to complete this book. She has suffered through this time with an absentee husband while bearing our first child— Camille. She was born about a week before the completion of my first draft and has been the apple of my eye ever since. I officially dedicate this book to my wife Sylvie and my daughter Camille, whom I both lovingly adore. I also pledge to read book dedications with newfound firsthand experience at what the dedication represents. I would also like to thank others who helped bring this book to fruition. Namely, if it were not for Doug Cutting creating Lucene with an open source license, there would be no Solr. Furthermore, CNet's decision to open source what was an in-house project, Solr itself in 2006, deserves praise. Many corporations do not understand that open source isn't just "free code" you get for free that others wrote; it is an opportunity to let your code flourish on the outside instead of it withering inside. Finally, I thank the team at Packt who were particularly patient with me as a first-time author writing at a pace that left a lot to be desired. Last but not least, this book would not have been completed in a reasonable time were it not for the assistance of my contributing author, Eric Pugh. His perspectives and experiences have complemented mine so well that I am absolutely certain the quality of this book is much better than what I could have done alone. Thank you all.
  6. Fascinated by the 'craft' of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past five years. He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we move from the read/write Web to the read/write/share Web. In biotech, financial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. As a speaker, he has advocated the advantages of Agile practices in software development. Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the open source model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. He blogs at http://www.opensourceconnections.com/blog/.
  7. Throughout my life I have been helped by so many people, but all too rarely do I get to explicitly thank them. This book is arguable one of the high points of my career, and as I wrote it, I thought about all the people who have provided encouragement, mentoring, and the occasional push to succeed. First off, I would like to thank Erik Hatcher, author, entrepreneur, and great family man for introducing me to the world of open source software. My first hesitant patch to Ant was made under his tutelage, and later my interest in Solr was fanned by his advocacy. Thanks to Harry Sleeper for taking a chance on a first time conference speaker; he moved me from thinking of myself as a developer improving myself to thinking of myself as a consultant improving the world (of software!). His team at MITRE are some of the most passionate developers I have met, and it was through them I met my co-author David. I owe a huge debt of gratitude to David Smiley. He has encouraged me, coached me, and put up with my lack of respect for book deadlines, making this book project a very positive experience! I look forward to the next one. With my new son Morgan at home, I could only have done this project with a generous support of time from my company, OpenSource Connections. I am incredibly proud of what o19s is accomplishing! Lastly, to the all the folks in the Solr/Lucene community who took the time to review early drafts and provide feedback: Solr is at the tipping point of becoming the "it" search engine because of your passion and commitment I am who I am because of my wife, Kate. Schweetie, real life for me began when we met. Thank you.
  8. About the Reviewers James Brady is an entrepreneur and software developer living in San Francisco, CA. Originally from England, James discovered his passion for computer science and programming while at Cambridge University. Upon graduation, James worked as a software engineer at IBM's Hursley Park laboratory—a role which taught him many things, most importantly, his desire to work in a small company. In January 2008, James founded WebMynd Corp., which received angel funding from the Y Combinator fund, and he relocated to San Francisco. WebMynd is one of the largest installations of Solr, indexing up to two million HTML documents per day, and making heavy use of Solr's multicore features to enable a partially active index. Jerome Eteve holds a BSC in physics, maths and computing and an MSC in IT and bioinformatics from the University of Lille (France). After starting his career in the field of bioinformatics, where he worked as a biological data management and analysis consultant, he's now a senior web developer with interests ranging from database level issues to user experience online. He's passionate about open source technologies, search engines, and web application architecture. At present, he is working since 2006 for Careerjet Ltd, a worldwide job search engine.
  9. Table of Contents Preface 1 Chapter 1: Quick Starting Solr 7 An introduction to Solr 7 Lucene, the underlying engine 8 Solr, the Server-ization of Lucene 8 Comparison to database technology 9 Getting started 10 The last official release or fresh code from source control 11 Testing and building Solr 12 Solr's installation directory structure 13 Solr's home directory 15 How Solr finds its home 15 Deploying and running Solr 17 A quick tour of Solr! 18 Loading sample data 20 A simple query 22 Some statistics 24 The schema and configuration files 25 Solr resources outside this book 26 Summary 27 Chapter 2: Schema and Text Analysis 29 MusicBrainz.org 30 One combined index or multiple indices 31 Problems with using a single combined index 33 Schema design 34 Step 1: Determine which searches are going to be powered by Solr 35 Step 2: Determine the entities returned from each search 35
  10. Table of Contents Step 3: Denormalize related data 36 Denormalizing—"one-to-one" associated data 36 Denormalizing—"one-to-many" associated data 36 Step 4: (Optional) Omit the inclusion of fields only used in search results 38 The schema.xml file 39 Field types 40 Field options 40 Field definitions 42 Sorting 44 Dynamic fields 45 Using copyField 46 Remaining schema.xml settings 47 Text analysis 47 Configuration 48 Experimenting with text analysis 50 Tokenization 52 WorkDelimiterFilterFactory 53 Stemming 54 Synonyms 55 Index-time versus Query-time, and to expand or not 57 Stop words 57 Phonetic sounds-like analysis 58 Partial/Substring indexing 60 N-gramming costs 61 Miscellaneous analyzers 62 Summary 63 Chapter 3: Indexing Data 65 Communicating with Solr 65 Direct HTTP or a convenient client API 65 Data streamed remotely or from Solr's filesystem 66 Data formats 66 Using curl to interact with Solr 66 Remote streaming 68 Sending XML to Solr 69 Deleting documents 70 Commit, optimize, and rollback 70 Sending CSV to Solr 72 Configuration options 73 Direct database and XML import 74 Getting started with DIH 75 The DIH development console 76 [ ii ]
  11. Table of Contents DIH documents, entities 78 DIH fields and transformers 79 Importing with DIH 80 Indexing documents with Solr Cell 81 Extracting binary content 81 Configuring Solr 83 Extracting karaoke lyrics 83 Indexing richer documents 85 Summary 88 Chapter 4: Basic Searching 89 Your first search, a walk-through 89 Solr's generic XML structured data representation 92 Solr's XML response format 93 Parsing the URL 94 Query parameters 95 Parameters affecting the query 95 Result paging 96 Output related parameters 96 Diagnostic query parameters 98 Query syntax 99 Matching all the documents 99 Mandatory, prohibited, and optional clauses 99 Boolean operators 100 Sub-expressions (aka sub-queries) 101 Limitations of prohibited clauses in sub-expressions 102 Field qualifier 102 Phrase queries and term proximity 103 Wildcard queries 103 Fuzzy queries 105 Range queries 105 Date math 106 Score boosting 107 Existence (and non-existence) queries 107 Escaping special characters 108 Filtering 108 Sorting 109 Request handlers 110 Scoring 112 Query-time and index-time boosting 113 Troubleshooting scoring 113 Summary 115 [ iii ]
  12. Table of Contents Chapter 5: Enhanced Searching 117 Function queries 117 An example: Scores influenced by a lookupcount 118 Field references 120 Function reference 120 Mathematical primitives 121 Miscellaneous math 121 ord and rord 122 An example with scale() and lookupcount 123 Using logarithms 123 Using inverse reciprocals 124 Using reciprocals and rord with dates 126 Function query tips 128 Dismax Solr request handler 128 Lucene's DisjunctionMaxQuery 130 Configuring queried fields and boosts 131 Limited query syntax 131 Boosting: Automatic phrase boosting 132 Configuring automatic phrase boosting 133 Phrase slop configuration 134 Boosting: Boost queries 134 Boosting: Boost functions 137 Min-should-match 138 Basic rules 139 Multiple rules 139 What to choose 140 A default search 140 Faceting 141 A quick example: Faceting release types 142 MusicBrainz schema changes 144 Field requirements 146 Types of faceting 146 Faceting text 147 Alphabetic range bucketing (A-C, D-F, and so on) 148 Faceting dates 149 Date facet parameters 151 Faceting on arbitrary queries 152 Excluding filters 153 The solution: Local Params 155 Facet prefixing (term suggest) 156 Summary 158 [ iv ]
  13. Table of Contents Chapter 6: Search Components 159 About components 159 The highlighting component 161 A highlighting example 161 Highlighting configuration 163 Query elevation 166 Configuration 167 Spell checking 169 Schema configuration 169 Configuration in solrconfig.xml 171 Configuring spellcheckers (dictionaries) 173 Processing of the q parameter 175 Processing of the spellcheck.q parameter 176 Building the dictionary from its source 176 Issuing spellcheck requests 177 Example usage for a mispelled query 178 An alternative approach 180 The more-like-this search component 182 Configuration parameters 183 Parameters specific to the MLT search component 183 Parameters specific to the MLT request handler 184 Common MLT parameters 185 MLT results example 186 Stats component 189 Configuring the stats component 189 Statistics on track durations 190 Field collapsing 191 Configuring field collapsing 192 Other components 193 Terms component 194 termVector component 194 LocalSolr component 194 Summary 195 Chapter 7: Deployment 197 Implementation methodology 197 Questions to ask 198 Installing into a Servlet container 199 Differences between Servlet containers 199 Defining solr.home property 199 [v]
  14. Table of Contents Logging 201 HTTP server request access logs 201 Solr application logging 203 Configuring logging output 203 Logging to Log4j 204 Jetty startup integration 205 Managing log levels at runtime 205 A SearchHandler per search interface 207 Solr cores 208 Configuring solr.xml 208 Managing cores 209 Why use multicore 210 JMX 212 Starting Solr with JMX 212 Take a walk on the wild side! Use JRuby to extract JMX information 215 Securing Solr 217 Limiting server access 217 Controlling JMX access 220 Securing index data 220 Controlling document access 221 Other things to look at 221 Summary 222 Chapter 8: Integrating Solr 223 Structure of included examples 223 Inventory of examples 224 SolrJ: Simple Java interface 224 Using Heritrix to download artist pages 226 Indexing HTML in Solr 227 SolrJ client API 230 Indexing POJOs 234 When should I use Embedded Solr 235 In-Process streaming 236 Rich clients 237 Upgrading from legacy Lucene 237 Using JavaScript to integrate Solr 238 Wait, what about security? 239 Building a Solr powered artists autocomplete widget with jQuery and JSONP 240 SolrJS: JavaScript interface to Solr 245 Accessing Solr from PHP applications 247 solr-php-client 248 Drupal options 250 Apache Solr Search integration module 251 [ vi ]
  15. Table of Contents Hosted Solr by Acquia 252 Ruby on Rails integrations 253 acts_as_solr 254 Setting up MyFaves project 255 Populating MyFaves relational database from Solr 256 Build Solr indexes from relational database 258 Complete MyFaves web site 260 Blacklight OPAC 263 Indexing MusicBrainz data 263 Customizing display 267 solr-ruby versus rsolr 269 Summary 270 Chapter 9: Scaling Solr 271 Tuning complex systems 271 Using Amazon EC2 to practice tuning 273 Firing up Solr on Amazon EC2 274 Optimizing a single Solr server (Scale High) 276 JVM configuration 277 HTTP caching 277 Solr caching 280 Tuning caches 281 Schema design considerations 282 Indexing strategies 283 Disable unique document checking 285 Commit/optimize factors 285 Enhancing faceting performance 286 Using term vectors 286 Improving phrase search performance 287 The solution: Shingling 287 Moving to multiple Solr servers (Scale Wide) 289 Script versus Java replication 289 Starting multiple Solr servers 290 Configuring replication 291 Distributing searches across slaves 291 Indexing into the master server 292 Configuring slaves 292 Distributing search queries across slaves 293 Sharding indexes 295 Assigning documents to shards 296 Searching across shards 297 Combining replication and sharding (Scale Deep) 298 Summary 300 Index 301 [ vii ]
  16. Preface Text search has been around for perhaps longer than we all can remember. Just about all systems, from client installed software to web sites to the web itself, have search. Yet there is a big difference between the best search experiences and the mediocre, unmemorable ones. If you want the application you're building to stand out above the rest, then it's got to have great search features. If you leave this to the capabilities of a database, then it's near impossible that you're going to get a great search experience, because it's not going to have features that users come to expect in a great search. With Solr, the leading open source search server, you'll tap into a host of features from highlighting search results to spell-checking to faceting. As you read Solr Enterprise Search Server you'll be guided through all of the aspects of Solr, from the initial download to eventual deployment and performance optimization. Nearly all the options of Solr are listed and described here, thus making this book a resource to turn to as you implement your Solr based solution. The book contains code examples in several programming languages that explore various integration options, such as implementing query auto-complete in a web browser and integrating a web crawler. You'll find these working examples in the online supplement to the book along with a large, real-world, openly available data set from MusicBrainz.org. Furthermore, you will also find instructions on accessing a Solr image readily deployed from within Amazon's Elastic Compute Cloud. Solr Enterprise Search Server targets the Solr 1.4 version. However, as this book went to print prior to Solr 1.4's release, two features were not incorporated into the book: search result clustering and trie-range numeric fields.
  17. Preface What this book covers Chapter 1, Quick Starting Solr introduces Solr to the reader as a middle ground between database technology and document/web crawlers. The reader is guided through the Solr distribution including running the sample configuration with sample data. Chapter 2, The Schema and Text Analysis is all about Solr's schema. The schema design is an important first order of business along with the related text analysis configuration. Chapter 3, Indexing Data details several methods to import data; most of them can be used to bring the MusicBrainz data set into the index. A popular Solr extension called the DataImportHandler is demonstrated too. Chapter 4, Basic Searching is a thorough reference to Solr's query syntax from the basics to range queries. Factors influencing Solr's scoring algorithm are explained here, as well as diagnostic output essential to understanding how the query worked and how a score is computed. Chapter 5, Enhanced Searching moves on to more querying topics. Various score boosting methods are explained from those based on record-level data to those that match particular fields or those that contain certain words. Next, faceting is a major subject area of this chapter. Finally, the term auto-complete is demonstrated, which is implemented by the faceting mechanism. Chapter 6, Search Components covers a variety of searching extras in the form of Solr "components", namely, spell-check suggestions, highlighting search results, computing statistics of numeric fields, editorial alterations to specific user queries, and finding other records "more like this". Chapter 7, Deployment transits from running Solr from a developer-centric perspective to deploying and running Solr as a deployed production enterprise service that is secure, has robust logging, and can be managed by System Administrators. Chapter 8, Integrating Solr surveys a plethora of integration options for Solr, from supported client libraries in Java, JavaScript, and Ruby, to being able to consume Solr results in XML, JSON, and even PHP syntaxes. We'll look at some best practices and approaches for integrating Solr into your web application. Chapter 9, Scaling Solr looks at how to scale Solr up and out to avoid meltdown and meet performance expectations. This information varies from small changes of configuration files to architectural options. [2]
  18. Preface Who this book is for This book is for developers who would like to use Solr to implement a search capability for their applications. You need only to have basic programming skills to use Solr; extending or modifying Solr itself requires Java programming. Knowledge of Lucene, the foundation of Solr, is certainly a bonus. Conventions In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text are shown as follows: "These are essentially defaults for searches that are processed by Solr request handlers defined in solrconfig.xml." A block of code is set as follows: id When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: mccm.pdf Any command-line input or output is written as follows: >> curl http://localhost:8983/solr/karaoke/update/ -H "Content-Type: text/xml" --data-binary '' New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Take for example the Top Voters section ". Warnings or important notes appear in a box like this. Tips and tricks appear like this. [3]
  19. Preface Reader feedback Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of. To send us general feedback, simply send an email to feedback@packtpub.com, and mention the book title via the subject of your message. If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or email suggest@packtpub.com. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book on, see our author guide on www.packtpub.com/authors. Customer support Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase. Downloading the example code for the book Visit http://www.packtpub.com/files/code/5883_Code.zip to directly download the example code. The downloadable files contain instructions on how to use them. Errata Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub. com/support, selecting your book, clicking on the let us know link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support. [4]
nguon tai.lieu . vn