To use you need to create a blobstorage account on http://azure.com.
Create an App.Config or Web.Config and configure your accountinto:
<?xml version="1.0" encoding="utf-8" ?>
<!-- azure SETTINGS -->
<add key="blobStorage" value="DefaultEndpointsProtocol=[http|https];AccountName=myAccountName;AccountKey=myAccountKey" />
To add documents to a catalog is as simple as
CloudStorageAccount cloudStorageAccount = CloudStorageAccount.DevelopmentStorageAccount;
CloudStorageAccount.TryParse(CloudConfigurationManager.GetSetting("blobStorage"), out cloudStorageAccount);
AzureDirectory azureDirectory = new AzureDirectory(cloudStorageAccount, "TestCatalog");
IndexWriter indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_CURRENT), true, new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
Document doc = new Document();
doc.Add(new Field("id", DateTime.Now.ToFileTimeUtc().ToString(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
doc.Add(new Field("Title", "this is my title", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
doc.Add(new Field("Body", "This is my body", Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
And searching is as easy as:
IndexSearcher searcher = new IndexSearcher(azureDirectory); Lucene.Net.QueryParsers.QueryParser parser = QueryParser("Title", new StandardAnalyzer()); Lucene.Net.Search.Query query = parser.Parse("Title:(Dog AND Cat)");
Hits hits = searcher.Search(query);
for (int i = 0; i < hits.Length(); i++)
Document doc = hits.Doc(i);
Caching and Compression
AzureDirectory compresses blobs before sent to the blob storage. Blobs are automatically cached local to reduce roundtrips for blobs which haven't changed.
By default AzureDirectory stores this local cache in a temporary folder. You can easily control where the local cache is stored by passing in a Directory object for whatever type and location of storage you want.
This example stores the cache in a ram directory:
AzureDirectory azureDirectory = new AzureDirectory("MyIndex", new RAMDirectory());
And this example stores in the file system in C:\myindex
AzureDirectory azureDirectory = new AzureDirectory("MyIndex", new FSDirectory(@"c:\myindex"));
Notes on settings
Just like a normal lucene index, calling optimize too often causes a lot of churn and not calling it enough causes too many segment files to be created, so call it "just enough" times. That will totally depend on your application and the nature
of your pattern of adding and updating items to determine (which is why lucene provides so many knobs to configure it's behavior).
The default compound file support that Lucene uses is to reduce the number of files that are generated...this means it deletes and merges files regularly which causes churn on the blob storage. Calling indexWriter.SetCompoundFiles(false) will give better performance.
We run it with a RAMDirectory for local cache and SetCompoundFiles(false);
The version of Lucene.NET checked in as a binary is Version 2.3.1, but you can use any version of Lucene.NET you want by simply enlisting from the above open source site.
How does this relate to Azure Tables?
Lucene doesn’t have any concept of tables. Lucene builds its own property store on top of the Directory() storage abstraction which essentially is both query and storage so it replicates the functionality of tables. You have to question the benefit
of having tables in this case.
With LinqToLucene you can have Linq and strongly typed objects just like table storage. Ultimately, Table storage is just an abstraction on top of blob storage, and so is Lucene (a table abstraction on top of blob storage).
Stated another way, just about anything you can build on table storage you can build on lucene storage.
If it is important that you have table storage as well as an Lucene index then any time you create a table entity you simply add that Entity to lucene as a document (either by a simply hand mapping or via reflection Linq To Lucene Annotations) as well. Queries
can then be against lucene, and properties retrieved from table storage or from Lucene.
But if you think about it you are duplicating your data then and not really getting much benefit.
There is 1 benefit to the table storage, and that is as an archive of the state of your data. If for some reason you need to rebuild your index you can simply reconstitute it from the table storage, but that’s probably the only time you would use the
table storage then.
How does this perform?
Lucene is capable of complex searches over millions of records in sub second times depending on how it is configured.
see http://lucene.apache.org/java/232/benchmarks.html for lots of details about Lucene in general.
But really this is a totally open ended question. It depends on:
- the amount of data
- the frequency of updates
- the kind of schema
Like any flexible system you can configure it to be supremely performant or supremely unperformant.
The key to getting good performance is for you to understand how Lucene works.
Lucene performs efficient incremental indexing by always appending data into files called segments. Periodically it will merge smaller segments into larger segments (a merge). The important thing to know is that it will NEVER modify an old segment, but instead
will create new segments and then delete old segments when they are no longer in use.
Lucene is built on top of an abstract storage class called a "Directory" object, and the Azure Library creates an implementation of that class called "AzureDirectory". The directory contract basically provides:
- the ability to enumerate segments
- the ability to delete segments
- providing a stream for Writing a file
- providing a stream for Reading a file
Existing Directory objects in Lucene are:
- RAMDirectory -- a in memory directory implementation
- FSDirectory -- a disk backed directory implementation
The AzureDirectory class implements the Directory contract as a wrapper around another Directory class which it uses as a local cache.
When Lucene asks to enumerate segments, AzureDirectory enumerates the segments in blob storage.
When Lucene asks to delete a segment, the AzureDirectory deletes the local cache segment and the blob in blob storage.
When Lucene asks to for a read stream for a segment (remember segments never change after being closed) AzureDirectory looks to see if it is in the local cache Directory, and if it is, simply returns the local cache stream for that segment. Otherwise it fetches
the segment from blobstorage, stores it in the local cache Directory and then returns the local cache steram for that segment.
When Lucene asks for a write stream for a segment it returns a wrapper around the stream in the local Directory cache, and on close it pushes the data up to a blob in blob storage.
The net result is that:
- all read operations will be performed against the local cache Directory object (which if it is a RAMDirectory is near instaneous).
- Any time a segment is missing in the local cache you will incure the cost of downloading the segment once.
- All Write operations are performed against the local cache Directory object until the segment is closed, at which point you incur the cost of uploading the segment.
The key piece to understand is that the amount of transactions you have to perform to blob storage depends on the Lucene settings which control how many segments you have before they are merged into a bigger segment (mergeFactor). Calling Optimize() is a really
bad idea because it causes ALL SEGMENTS to be merged into ONE SEGMENT...essentially causing the entire index to have to be recreated, uploaded to blob storage and downloaded to all consumers.
The other big factor is how often you create your searcher objects. When you create a Lucene Searcher object it essentially binds to the view of the index at that point in time. Regardless of how many updates are made to the index by other processes, the searcher
object will have a static view of the index in it's local cache Directory object. If you want to update the view of the searcher, you simply discard the old one and create a new one and again it will be up to date for the current state of the index.
If you control those factors, you can have a super scalable fast system which can handle millions of records and thousands of queries per second no problem.
What is the best way to build an Azure application around this?
Of course that depends on your data flow, etc. but in general here is an example architecture that works well:
The index can only be updated by one process at a time, so it makes sense to push all Add/Update/Delete operations through an indexing role. The obvious way to do that is to have an Azure queue which feeds a stream of objects to be indexed to a worker role
which maintains updating the index.
On the search side, you can have a search WebRole which simply creates an AzureDirectory with a RAMDirectory pointed to the blob storage the indexing role is maintaining. As appropriate (say once a minute) the searcher webrole would create a new IndexSearcher
object around the index, and any changes will automatically be synced into the cache directory on the searcher webRole.
To scale your search engine you can simply increase the instance count of the searcher webrole to handle the load.