Apache Solr 3 Enterprise Search Server

David Smiley, Eric Pugh

Overview | Reviews | Author | Support | Sample Chapters

eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats

$25.49
save 15%!

Print + free eBook + free PacktLib access to the book: $79.98 Print cover: $49.99

$49.99
save 37%!

Free Shipping!

UK, US, Europe and selected countries in Asia.

Also available on:

Overview

Table of Contents

Author

Reviews

Support

Sample Chapters

Comprehensive information on Apache Solr 3 with examples and tips so you can focus on the important parts
Integration examples with databases, web-crawlers, XSLT, Java & embedded-Solr, PHP & Drupal, JavaScript, Ruby frameworks
Advice on data modeling, deployment considerations to include security, logging, and monitoring, and advice on scaling Solr and measuring performance
An update of the best-selling title on Solr 1.4

Appendix

Book Details

Language : English
Paperback : 418 pages [ 235mm x 191mm ]
Release Date : November 2011
ISBN : 1849516065
ISBN 13 : 9781849516068
Author(s) : David Smiley, Eric Pugh
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source

Preface
Chapter 1: Quick Starting Solr
Chapter 2: Schema and Text Analysis
Chapter 3: Indexing Data
Chapter 4: Searching
Chapter 5: Search Relevancy
Chapter 6: Faceting
Chapter 7: Search Components
Chapter 8: Deployment
Chapter 9: Integrating Solr
Chapter 10: Scaling Solr
Appendix: Search Quick Reference
Index

Preface

Chapter 1: Quick Starting Solr

An introduction to Solr

Lucene, the underlying engine
Solr, a Lucene-based search server
Comparison to database technology

Getting started

Solr's installation directory structure
Solr's home directory and Solr cores
Running Solr

A quick tour of Solr

Loading sample data
A simple query
Some statistics
The sample browse interface

Configuration files
Resources outside this book
Summary

Chapter 2: Schema and Text Analysis

MusicBrainz.org
One combined index or separate indices

One combined index

Problems with using a single combined index

Separate indices

Schema design

Step 1: Determine which searches are going to be powered by Solr
Step 2: Determine the entities returned from each search
Step 3: Denormalize related data

Denormalizing—'one-to-one' associated data
Denormalizing—'one-to-many' associated data

Step 4: (Optional) Omit the inclusion of fields only used in search results

The schema.xml file

Defining field types
Built-in field type classes

Numbers and dates
Geospatial

Field options
Field definitions

Dynamic field definitions

Our MusicBrainz field definitions
Copying fields
The unique key
The default search field and query operator

Text analysis

Configuration
Experimenting with text analysis
Character filters
Tokenization
WordDelimiterFilter
Stemming

Correcting and augmenting stemming

Synonyms

Index-time versus query-time, and to expand or not

Stop words
Phonetic sounds-like analysis
Substring indexing and wildcards

ReversedWildcardFilter
N-grams
N-gram costs

Sorting Text
Miscellaneous token filters

Summary

Chapter 3: Indexing Data

Communicating with Solr

Direct HTTP or a convenient client API
Push data to Solr or have Solr pull it
Data formats
HTTP POSTing options to Solr
Remote streaming

Solr's Update-XML format

Deleting documents

Commit, optimize, and rollback
Sending CSV formatted data to Solr

Configuration options

The Data Import Handler Framework

Setup
The development console
Writing a DIH configuration file

Data Sources
Entity processors
Fields and transformers

Example DIH configurations

Importing from databases
Importing XML from a file with XSLT
Importing multiple rich document files (crawling)

Importing commands

Delta imports

Indexing documents with Solr Cell

Extracting text and metadata from files
Configuring Solr
Solr Cell parameters
Extracting karaoke lyrics
Indexing richer documents

Update request processors
Summary

Chapter 4: Searching

Your first search, a walk-through
Solr's generic XML structured data representation
Solr's XML response format

Parsing the URL

Request handlers
Query parameters

Search criteria related parameters
Result pagination related parameters
Output related parameters
Diagnostic related parameters

Query parsers and local-params
Query syntax (the lucene query parser)

Matching all the documents
Mandatory, prohibited, and optional clauses

Boolean operators

Sub-queries

Limitations of prohibited clauses in sub-queries

Field qualifier
Phrase queries and term proximity
Wildcard queries

Fuzzy queries

Range queries

Date math

Score boosting
Existence (and non-existence) queries
Escaping special characters

The Dismax query parser (part 1)

Searching multiple fields
Limited query syntax
Min-should-match

Basic rules
Multiple rules
What to choose

A default search

Filtering
Sorting
Geospatial search

Indexing locations
Filtering by distance
Sorting by distance

Summary

Chapter 5: Search Relevancy

Scoring

Query-time and index-time boosting
Troubleshooting queries and scoring

Dismax query parser (part 2)

Lucene's DisjunctionMaxQuery
Boosting: Automatic phrase boosting

Configuring automatic phrase boosting
Phrase slop configuration
Partial phrase boosting

Boosting: Boost queries
Boosting: Boost functions

Add or multiply boosts?

Function queries

Field references
Function reference

Mathematical primitives
Other math
ord and rord
Miscellaneous functions

Function query boosting

Formula: Logarithm
Formula: Inverse reciprocal
Formula: Reciprocal
Formula: Linear

How to boost based on an increasing numeric field

Step by step…
External field values

How to boost based on recent dates

Step by step…

Summary

Chapter 6: Faceting

A quick example: Faceting release types

MusicBrainz schema changes

Field requirements
Types of faceting
Faceting field values

Alphabetic range bucketing

Faceting numeric and date ranges

Range facet parameters

Facet queries
Building a filter query from a facet

Field value filter queries
Facet range filter queries

Excluding filters (multi-select faceting)
Hierarchical faceting
Summary

Chapter 7: Search Components

About components
The Highlight component

A highlighting example
Highlighting configuration

The regex fragmenter
The fast vector highlighter with multi-colored highlighting

The SpellCheck component

Schema configuration
Configuration in solrconfig.xml

Configuring spellcheckers (dictionaries)
Processing of the q parameter
Processing of the spellcheck.q parameter

Building the dictionary from its source
Issuing spellcheck requests
Example usage for a misspelled query

Query complete / suggest

Query term completion via facet.prefix
Query term completion via the Suggester
Query term completion via the Terms component

The QueryElevation component

Configuration

The MoreLikeThis component

Configuration parameters

Parameters specific to the MLT search component
Parameters specific to the MLT request handler
Common MLT parameters

MLT results example

The Stats component

Configuring the stats component
Statistics on track durations

The Clustering component
Result grouping/Field collapsing

Configuring result grouping

The TermVector component
Summary

Chapter 8: Deployment

Deployment methodology for Solr

Questions to ask

Installing Solr into a Servlet container

Differences between Servlet containers

Defining solr.home property

Logging

HTTP server request access logs
Solr application logging

Configuring logging output
Logging using Log4j
Jetty startup integration
Managing log levels at runtime

A SearchHandler per search interface?
Leveraging Solr cores

Configuring solr.xml

Property substitution
Include fragments of XML with XInclude

Managing cores
Why use multicore?

Monitoring Solr performance

Stats.jsp
JMX

Starting Solr with JMX

Securing Solr from prying eyes

Limiting server access

Securing public searches
Controlling JMX access

Securing index data

Controlling document access
Other things to look at

Summary

Chapter 9: Integrating Solr

Working with included examples

Inventory of examples

Solritas, the integrated search UI

Pros and Cons of Solritas

SolrJ: Simple Java interface

Using Heritrix to download artist pages
SolrJ-based client for Indexing HTML
SolrJ client API

Embedding Solr
Searching with SolrJ
Indexing

When should I use embedded Solr?

In-process indexing
Standalone desktop applications
Upgrading from legacy Lucene

Using JavaScript with Solr

Wait, what about security?
Building a Solr powered artists autocomplete widget with jQuery and JSONP
AJAX Solr

Using XSLT to expose Solr via OpenSearch

OpenSearch based Browse plugin

Installing the Search MBArtists plugin

Accessing Solr from PHP applications

solr-php-client
Drupal options

Apache Solr Search integration module
Hosted Solr by Acquia

Ruby on Rails integrations

The Ruby query response writer
sunspot_rails gem

Setting up MyFaves project
Populating MyFaves relational database from Solr
Build Solr indexes from a relational database
Complete MyFaves website

Which Rails/Ruby library should I use?

Nutch for crawling web pages
Maintaining document security with ManifoldCF

Connectors
Putting ManifoldCF to use

Summary

Chapter 10: Scaling Solr

Tuning complex systems
Testing Solr performance with SolrMeter
Optimizing a single Solr server (Scale up)

Configuring JVM settings to improve memory usage

MMapDirectoryFactory to leverage additional virtual memory

Enabling downstream HTTP caching
Solr caching

Tuning caches

Indexing performance

Designing the schema
Sending data to Solr in bulk
Don't overlap commits
Disabling unique key checking
Index optimization factors

Enhancing faceting performance
Using term vectors
Improving phrase search performance

Moving to multiple Solr servers (Scale horizontally)

Replication
Starting multiple Solr servers

Configuring replication

Load balancing searches across slaves

Indexing into the master server
Configuring slaves

Configuring load balancing
Sharding indexes

Assigning documents to shards
Searching across shards (distributed search)

Combining replication and sharding (Scale deep)

Near real time search

Where next for scaling Solr?
Summary

Appendix: Search Quick Reference

Quick reference

Index

David Smiley

Born to code, David Smiley is a senior software engineer, book author, conference speaker, and instructor. He has 12 years of experience in the defense industry at MITRE, specializing in Java and Web technologies. David is the principal author of "Solr 1.4 Enterprise Search Server", the first book on Solr, published by PACKT in 2009. He also developed and taught a two-day course on Solr for MITRE. David plays a lead technical role in a large-scale Solr project in which he has implemented geospatial search based on geohash prefixes, wildcard ngram query parsing, searching multiple multi-valued fields at coordinated positions, part-of-speech search using Lucene payloads, and other things. David consults as a Solr expert on numerous projects for MITRE and its government sponsors. He has contributed code to Lucene and Solr and is active in the open-source community. Prior to his Solr work, David first used Lucene back in 2000, as well as Hibernate-Search and Compass since then. He also used the competing Endeca commercial product, too, but hopes to never use it again.

Eric Pugh

Fascinated by the 'craft' of software development, Eric Pugh has been heavily involved in the open source world as a developer, committer, and user for the past five years. He is an emeritus member of the Apache Software Foundation and lately has been mulling over how we solve the problem of finding answers in datasets when we don’t know the questions ahead of time to ask. In biotech, financial services, and defense IT, he has helped European and American companies develop coherent strategies for embracing open source search software. As a speaker, he has advocated the advantages of Agile practices with a focus on testing in search engine implementation. Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the open source model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell as part of Solr version 1.4. He blogs at http://www.opensourceconnections.com/.

Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.

Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

Errata

- 7 submitted: last submission 11 Apr 2013

Errata type: Typo | Errata date: 17th May 12

Page no.: 26

Check the second line in the second paragraph. "At the moment, we're just going to take a peak at the request handlers, which are defined with <requestHandler> elements."

Here, the word "peak" is wrong. Instead, it should be "peek".

Errata type: Typo | Errata date: 24th December 12

Check for the instances of word "solconfig.xml" :

Page 114: Check the sentence "This XML is used for most of the response XML and it is also used in parts of solconfig.xml too."

Page 204: Check the sentence "Choose the snippet fragmenting algorithm. This parameter refers to a named <fragmenter/> element in <highlighting/> in solconfig.xml. gap is the default typical choice based on a fragment size."

Page 204: Check the sentence "This parameter refers to a named <formatter/> element in <highlighting/> in solconfig.xml."

Page 204: Check the sentence "This is a reference to a named <encoder/> element in <highlighting/> in solconfig.xml."

Page 206: Check the sentence "This parameter refers to a named <fragListBuilder/> element in <highlighting/> in solconfig.xml."

Page 206: Check the sentence "This parameter refers to a named <fragmentsBuilder/> element in <highlighting/> in solconfig.xml."

All these instances are wrong. It should be replaced by "solrconfig.xml".

Errata type: Typo | Errata date: 24th December 12

Page 164: Check the sentence "With luck, you may find some websites that will suffice, perhaps http://www.wolframapha.com."

This link is wrong. It should be "http://www.wolframalpha.com". (alpha instead of apha)

Errata type: Typo | Errata date: 24th December 12

Check for the instances of the word "XHMTL":

Page 103: Check the sentence "To return only the metadata, and discard all the body content of the XHMTL you would use xpath=/xhtml:html/ xhtml:head/descendant:node()."

Page 104: Check the sentenc "Defaults to xml to produce the XHMTL structure."

Page 108: Check the sentence "This returns an XHMTL document that contains the metadata extracted from the document in the <head/> stanza, as well as the basic structure of the contents expressed as XHTML."

All these instances are wrong. It should be replaced by "XHTML".

Errata type: Typo | Errata date: 21st February 12

Page no. 347: Check the line of code <filter class="solr.CommonGramsFilterFactory" words="commongrams.txt" ignoreCase="true"/>"/>

There's an extra "/> at the end of this line. It should be removed.

Errata type: Typo | Errata date: 2nd March 12

Page no. 250:

The last paragraph talks about the HTTP server request log format and about the last number in a logline being the time in milliseconds for serving the request. This is not correct, as this represents the number of bytes sent in the response.
While it might be true, that Jetty compiles JSP pages on first request, the number "3816" refers to the HTML size of the admin page being sent over the line. (Otherwise subsequent requests should show a smaller number).

Errata type: Typo | Errata date: 25th April 12

Please check page no. 2 : "What you need for this book" section. First bullet point-"Java 6, a JDK release. Do not use Java 7."

There was a problem with the initial release of Java 7, but the 7u1 release addressed the problem and so Java 7 is cleared now. So, page 2 says not to use Java 7, which is an outdated advise.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Business Process Driven SOA using BPMN and BPEL

50% Off
the eBooks

Buy both these recommended eBooks together and get 50% off the total price

What you will learn from this book

Design a schema to include text indexing details like tokenization, stemming, and synonyms
Import data using various formats like CSV, XML, and from databases, and extract text from common document formats
Search using Solr’s rich query syntax, perform geospatial searches, and influence relevancy order
Enhance search results with faceting, query spell-checking, auto-completing queries, highlighted search results, and more
Integrate a host of technologies with Solr from the server side to client-side JavaScript, to frameworks like Drupal
Scale Solr by learning how to tune it and how to use replication and sharding

In Detail

If you are a developer building an app today then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-check, relevancy tuning, and more.

Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks.

Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and "boosting" match scores based on record data.
Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.

Approach

The book is written as a reference guide. It includes fully working examples based on a real-world public data set.

Who this book is for

This book is for developers who want to learn how to use Apache Solr in their applications. Only basic programming skills are needed.

Apache Solr 3 Enterprise Search Server

Book Details

Table of Contents

David Smiley

Eric Pugh

Code Downloads

Submit Errata

Errata

Sample chapters

Frequently bought together

What you will learn from this book

In Detail

Approach

Who this book is for

Your Shopping Cart

Submit an image

Footer Copyright