Apache Solr 3 Enterprise Search Server
Formats:
![PDF, PacktLib, ePub and Mobi formats](http://web.archive.org/web/20130727143617im_/http://dgdsbygo8mp3h.cloudfront.net/sites/default/files/formats.png)
save 15%!
save 37%!
![](http://web.archive.org/web/20130727143617im_/http://dgdsbygo8mp3h.cloudfront.net/sites/default/files/new-design-assets/free-shipping.png)
Also available on: |
![]() ![]() ![]() ![]() |
- Comprehensive information on Apache Solr 3 with examples and tips so you can focus on the important parts
- Integration examples with databases, web-crawlers, XSLT, Java & embedded-Solr, PHP & Drupal, JavaScript, Ruby frameworks
- Advice on data modeling, deployment considerations to include security, logging, and monitoring, and advice on scaling Solr and measuring performance
- An update of the best-selling title on Solr 1.4
Book Details
Language : EnglishPaperback : 418 pages [ 235mm x 191mm ]
Release Date : November 2011
ISBN : 1849516065
ISBN 13 : 9781849516068
Author(s) : David Smiley, Eric Pugh
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source
Table of Contents
PrefaceChapter 1: Quick Starting Solr
Chapter 2: Schema and Text Analysis
Chapter 3: Indexing Data
Chapter 4: Searching
Chapter 5: Search Relevancy
Chapter 6: Faceting
Chapter 7: Search Components
Chapter 8: Deployment
Chapter 9: Integrating Solr
Chapter 10: Scaling Solr
Appendix: Search Quick Reference
Index
- Chapter 1: Quick Starting Solr
- An introduction to Solr
- Lucene, the underlying engine
- Solr, a Lucene-based search server
- Comparison to database technology
- Getting started
- Solr's installation directory structure
- Solr's home directory and Solr cores
- Running Solr
- A quick tour of Solr
- Loading sample data
- A simple query
- Some statistics
- The sample browse interface
- Configuration files
- Resources outside this book
- Summary
- Chapter 2: Schema and Text Analysis
- MusicBrainz.org
- One combined index or separate indices
- One combined index
- Problems with using a single combined index
- Separate indices
- Schema design
- Step 1: Determine which searches are going to be powered by Solr
- Step 2: Determine the entities returned from each search
- Step 3: Denormalize related data
- Denormalizing—'one-to-one' associated data
- Denormalizing—'one-to-many' associated data
- Step 4: (Optional) Omit the inclusion of fields only used in search results
- The schema.xml file
- Defining field types
- Built-in field type classes
- Numbers and dates
- Geospatial
- Field options
- Field definitions
- Dynamic field definitions
- Our MusicBrainz field definitions
- Copying fields
- The unique key
- The default search field and query operator
- Text analysis
- Configuration
- Experimenting with text analysis
- Character filters
- Tokenization
- WordDelimiterFilter
- Stemming
- Correcting and augmenting stemming
- Synonyms
- Index-time versus query-time, and to expand or not
- Stop words
- Phonetic sounds-like analysis
- Substring indexing and wildcards
- ReversedWildcardFilter
- N-grams
- N-gram costs
- Sorting Text
- Miscellaneous token filters
- Summary
- Chapter 3: Indexing Data
- Communicating with Solr
- Direct HTTP or a convenient client API
- Push data to Solr or have Solr pull it
- Data formats
- HTTP POSTing options to Solr
- Remote streaming
- Solr's Update-XML format
- Deleting documents
- Commit, optimize, and rollback
- Sending CSV formatted data to Solr
- Configuration options
- The Data Import Handler Framework
- Setup
- The development console
- Writing a DIH configuration file
- Data Sources
- Entity processors
- Fields and transformers
- Example DIH configurations
- Importing from databases
- Importing XML from a file with XSLT
- Importing multiple rich document files (crawling)
- Importing commands
- Delta imports
- Indexing documents with Solr Cell
- Extracting text and metadata from files
- Configuring Solr
- Solr Cell parameters
- Extracting karaoke lyrics
- Indexing richer documents
- Update request processors
- Summary
- Chapter 4: Searching
- Your first search, a walk-through
- Solr's generic XML structured data representation
- Solr's XML response format
- Parsing the URL
- Request handlers
- Query parameters
- Search criteria related parameters
- Result pagination related parameters
- Output related parameters
- Diagnostic related parameters
- Query parsers and local-params
- Query syntax (the lucene query parser)
- Matching all the documents
- Mandatory, prohibited, and optional clauses
- Boolean operators
- Sub-queries
- Limitations of prohibited clauses in sub-queries
- Field qualifier
- Phrase queries and term proximity
- Wildcard queries
- Fuzzy queries
- Range queries
- Date math
- Score boosting
- Existence (and non-existence) queries
- Escaping special characters
- The Dismax query parser (part 1)
- Searching multiple fields
- Limited query syntax
- Min-should-match
- Basic rules
- Multiple rules
- What to choose
- A default search
- Filtering
- Sorting
- Geospatial search
- Indexing locations
- Filtering by distance
- Sorting by distance
- Summary
- Chapter 5: Search Relevancy
- Scoring
- Query-time and index-time boosting
- Troubleshooting queries and scoring
- Dismax query parser (part 2)
- Lucene's DisjunctionMaxQuery
- Boosting: Automatic phrase boosting
- Configuring automatic phrase boosting
- Phrase slop configuration
- Partial phrase boosting
- Boosting: Boost queries
- Boosting: Boost functions
- Add or multiply boosts?
- Function queries
- Field references
- Function reference
- Mathematical primitives
- Other math
- ord and rord
- Miscellaneous functions
- Function query boosting
- Formula: Logarithm
- Formula: Inverse reciprocal
- Formula: Reciprocal
- Formula: Linear
- How to boost based on an increasing numeric field
- Step by step…
- External field values
- How to boost based on recent dates
- Step by step…
- Summary
- Chapter 6: Faceting
- A quick example: Faceting release types
- MusicBrainz schema changes
- Field requirements
- Types of faceting
- Faceting field values
- Alphabetic range bucketing
- Faceting numeric and date ranges
- Range facet parameters
- Facet queries
- Building a filter query from a facet
- Field value filter queries
- Facet range filter queries
- Excluding filters (multi-select faceting)
- Hierarchical faceting
- Summary
- Chapter 7: Search Components
- About components
- The Highlight component
- A highlighting example
- Highlighting configuration
- The regex fragmenter
- The fast vector highlighter with multi-colored highlighting
- The SpellCheck component
- Schema configuration
- Configuration in solrconfig.xml
- Configuring spellcheckers (dictionaries)
- Processing of the q parameter
- Processing of the spellcheck.q parameter
- Building the dictionary from its source
- Issuing spellcheck requests
- Example usage for a misspelled query
- Query complete / suggest
- Query term completion via facet.prefix
- Query term completion via the Suggester
- Query term completion via the Terms component
- The QueryElevation component
- Configuration
- The MoreLikeThis component
- Configuration parameters
- Parameters specific to the MLT search component
- Parameters specific to the MLT request handler
- Common MLT parameters
- MLT results example
- The Stats component
- Configuring the stats component
- Statistics on track durations
- The Clustering component
- Result grouping/Field collapsing
- Configuring result grouping
- The TermVector component
- Summary
- Chapter 8: Deployment
- Deployment methodology for Solr
- Questions to ask
- Installing Solr into a Servlet container
- Differences between Servlet containers
- Defining solr.home property
- Logging
- HTTP server request access logs
- Solr application logging
- Configuring logging output
- Logging using Log4j
- Jetty startup integration
- Managing log levels at runtime
- A SearchHandler per search interface?
- Leveraging Solr cores
- Configuring solr.xml
- Property substitution
- Include fragments of XML with XInclude
- Managing cores
- Why use multicore?
- Monitoring Solr performance
- Stats.jsp
- JMX
- Starting Solr with JMX
- Securing Solr from prying eyes
- Limiting server access
- Securing public searches
- Controlling JMX access
- Securing index data
- Controlling document access
- Other things to look at
- Summary
- Chapter 9: Integrating Solr
- Working with included examples
- Inventory of examples
- Solritas, the integrated search UI
- Pros and Cons of Solritas
- SolrJ: Simple Java interface
- Using Heritrix to download artist pages
- SolrJ-based client for Indexing HTML
- SolrJ client API
- Embedding Solr
- Searching with SolrJ
- Indexing
- When should I use embedded Solr?
- In-process indexing
- Standalone desktop applications
- Upgrading from legacy Lucene
- Using JavaScript with Solr
- Wait, what about security?
- Building a Solr powered artists autocomplete widget with jQuery and JSONP
- AJAX Solr
- Using XSLT to expose Solr via OpenSearch
- OpenSearch based Browse plugin
- Installing the Search MBArtists plugin
- Accessing Solr from PHP applications
- solr-php-client
- Drupal options
- Apache Solr Search integration module
- Hosted Solr by Acquia
- Ruby on Rails integrations
- The Ruby query response writer
- sunspot_rails gem
- Setting up MyFaves project
- Populating MyFaves relational database from Solr
- Build Solr indexes from a relational database
- Complete MyFaves website
- Which Rails/Ruby library should I use?
- Nutch for crawling web pages
- Maintaining document security with ManifoldCF
- Connectors
- Putting ManifoldCF to use
- Summary
- Chapter 10: Scaling Solr
- Tuning complex systems
- Testing Solr performance with SolrMeter
- Optimizing a single Solr server (Scale up)
- Configuring JVM settings to improve memory usage
- MMapDirectoryFactory to leverage additional virtual memory
- Enabling downstream HTTP caching
- Solr caching
- Tuning caches
- Indexing performance
- Designing the schema
- Sending data to Solr in bulk
- Don't overlap commits
- Disabling unique key checking
- Index optimization factors
- Enhancing faceting performance
- Using term vectors
- Improving phrase search performance
- Moving to multiple Solr servers (Scale horizontally)
- Replication
- Starting multiple Solr servers
- Configuring replication
- Load balancing searches across slaves
- Indexing into the master server
- Configuring slaves
- Configuring load balancing
- Sharding indexes
- Assigning documents to shards
- Searching across shards (distributed search)
- Combining replication and sharding (Scale deep)
- Near real time search
- Where next for scaling Solr?
- Summary
- Appendix: Search Quick Reference
- Quick reference
David Smiley
Eric Pugh
Code Downloads
Download the code and support files for this book.
Submit Errata
Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.
Errata
- 7 submitted: last submission 11 Apr 2013Errata type: Typo | Errata date: 17th May 12
Page no.: 26
Check the second line in the second paragraph. "At the moment, we're just going to take a peak at the request handlers, which are defined with <requestHandler> elements."
Here, the word "peak" is wrong. Instead, it should be "peek".
Errata type: Typo | Errata date: 24th December 12
Check for the instances of word "solconfig.xml" :
Page 114: Check the sentence "This XML is used for most of the response XML and it is also used in parts of solconfig.xml too."
Page 204: Check the sentence "Choose the snippet fragmenting algorithm. This parameter refers to a named <fragmenter/> element in <highlighting/> in solconfig.xml. gap is the default typical choice based on a fragment size."
Page 204: Check the sentence "This parameter refers to a named <formatter/> element in <highlighting/> in solconfig.xml."
Page 204: Check the sentence "This is a reference to a named <encoder/> element in <highlighting/> in solconfig.xml."
Page 206: Check the sentence "This parameter refers to a named <fragListBuilder/> element in <highlighting/> in solconfig.xml."
Page 206: Check the sentence "This parameter refers to a named <fragmentsBuilder/> element in <highlighting/> in solconfig.xml."
All these instances are wrong. It should be replaced by "solrconfig.xml".
Errata type: Typo | Errata date: 24th December 12
Page 164: Check the sentence "With luck, you may find some websites that will suffice, perhaps http://www.wolframapha.com."
This link is wrong. It should be "http://www.wolframalpha.com". (alpha instead of apha)
Errata type: Typo | Errata date: 24th December 12
Check for the instances of the word "XHMTL":
Page 103: Check the sentence "To return only the metadata, and discard all the body content of the XHMTL you would use xpath=/xhtml:html/ xhtml:head/descendant:node()."
Page 104: Check the sentenc "Defaults to xml to produce the XHMTL structure."
Page 108: Check the sentence "This returns an XHMTL document that contains the metadata extracted from the document in the <head/> stanza, as well as the basic structure of the contents expressed as XHTML."
All these instances are wrong. It should be replaced by "XHTML".
Errata type: Typo | Errata date: 21st February 12
Page no. 347: Check the line of code <filter class="solr.CommonGramsFilterFactory" words="commongrams.txt" ignoreCase="true"/>"/>
There's an extra "/> at the end of this line. It should be removed.
Errata type: Typo | Errata date: 2nd March 12
Page no. 250:
The last paragraph talks about the HTTP server request log format and about the last number in a logline being the time in milliseconds for serving the request. This is not correct, as this represents the number of bytes sent in the response.
While it might be true, that Jetty compiles JSP pages on first request, the number "3816" refers to the HTML size of the admin page being sent over the line. (Otherwise subsequent requests should show a smaller number).
Errata type: Typo | Errata date: 25th April 12
Please check page no. 2 : "What you need for this book" section. First bullet point-"Java 6, a JDK release. Do not use Java 7."
There was a problem with the initial release of Java 7, but the 7u1 release addressed the problem and so Java 7 is cleared now. So, page 2 says not to use Java 7, which is an outdated advise.
Sample chapters
You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.
- Design a schema to include text indexing details like tokenization, stemming, and synonyms
- Import data using various formats like CSV, XML, and from databases, and extract text from common document formats
- Search using Solr’s rich query syntax, perform geospatial searches, and influence relevancy order
- Enhance search results with faceting, query spell-checking, auto-completing queries, highlighted search results, and more
- Integrate a host of technologies with Solr from the server side to client-side JavaScript, to frameworks like Drupal
- Scale Solr by learning how to tune it and how to use replication and sharding
If you are a developer building an app today then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-check, relevancy tuning, and more.
Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks.
Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and "boosting" match scores based on record data.
Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.
The book is written as a reference guide. It includes fully working examples based on a real-world public data set.
This book is for developers who want to learn how to use Apache Solr in their applications. Only basic programming skills are needed.