Tuesday 21 August 2012

Amazon CloudSearch - Start Searching in One Hour for Less Than $100 / Month


Extract from Amazon Web Service Evangelist Jeff Barr's CloudSearch blog post for more information about how you can start searching in an hour for less than $100 a month...

Continuing along in our quest to give you the tools that you need to build ridiculously powerful web sites and applications in no time flat at the lowest possible cost, I'd like to introduce you to Amazon CloudSearch. If you have ever searched Amazon.com, you've already used the technology that underlies CloudSearch. You can now have a very powerful and scalable search system (indexing and retrieval) up and running in less than an hour.

You, sitting in your corporate cubicle, your coffee shop, or your dorm room, now have access to search technology at a very affordable price. You can start to take advantage of many years of Amazon R&D in the search space for just $0.12 per hour (I'll talk about pricing in depth later).


What is Search?

Search plays a major role in many web sites and other types of online applications. The basic model is seemingly simple. Think of your set of documents or your data collection as a book or a catalog, composed of a number of pages. You know that you can find the desired content quickly and efficiently by simply consulting the index.

Search does the same thing by indexing each document in a way that facilitates rapid retrieval. You enter some terms into a search box and the site responds (rather quickly if you use CloudSearch) with a list of pages that match the search terms.

As is the case with many things, this simple model masks a lot of complexity and might raise a lot of questions in your mind. For example:
  1. How efficient is the search? Did the search engine simply iterate through every page, looking for matches, or is there some sort of index?
  2. The search results were returned in the form of an ordered list. What factor(s) determined which documents were returned, and in what order (commonly known as ranking)? How are the results grouped?
  3. How forgiving or expansive was the search? Did a search for "dogs" return results for "dog?" Did it return results for "golden retriever," or "pet?"
  4. What kinds of complex searches or queries can be used? Does the result for "dog training" return the expected results. Can you search for "dog" in the Title field and "training" in the Description?
  5. How scalable is the search? What if there are millions or billions of pages? What if there are thousands of searches per hour? Is there enough storage space?
  6. What happens when new pages are added to the collection, or old pages are removed? How does this affect the search results?
  7. How can you efficiently navigate through and explore search results? Can you group and filter the search results in ways that take advantage of multiple named fields (often known as a faceted search).
Needless to say, things can get very complex very quickly. Even if you can write code to do some or all of this yourself, you still need to worry about the operational aspects. We know that scaling a search system is non-trivial. There are lots of moving parts, all of which must be designed, implemented, instantiated, scaled, monitored, and maintained. As you scale, algorithmic complexity often comes in to play; you soon learn that algorithms and techniques which were practical at the beginning aren't always practical at scale.


What is Amazon CloudSearch?

Amazon CloudSearch is a fully managed search service in the cloud. You can set it up and start processing queries in less than an hour, with automatic scaling for data and search traffic, all for less than $100 per month.

CloudSearch hides all of the complexity and all of the search infrastructure from you. You simply provide it with a set of documents and decide how you would like to incorporate search into your application.

You don't have to write your own indexing, query parsing, query processing, results handling, or any of that other stuff. You don't need to worry about running out of disk space or processing power, and you don't need to keep rewriting your code to add more features.

With CloudSearch, you can focus on your application layer. You upload your documents, CloudSearch indexes them, and you can build a search experience that is custom-tailored to the needs of your customers.


How Does it Work?

The Amazon CloudSearch model is really simple, but don't confuse simple, with simplistic -- there's a lot going on behind the scenes!

Here's all you need to do to get started (you can perform these operations from the AWS Management Console, the CloudSearch command line tools, or through the CloudSearch APIs):
  1. Create and configure a Search Domain. This is a data container and a related set of services. It exists within a particular Availability Zone of a single AWS Region (initially US East).
  2. Upload your documents. Documents can be uploaded as JSON or XML that conforms to our Search Document Format (SDF). Uploaded documents will typically be searchable within seconds.  You can, if you'd like, send data over an HTTPS connection to protect it while it is transit.
  3. Perform searches.
There are plenty of options and goodies, but that's all it takes to get started.

Amazon CloudSearch applies data updates continuously, so newly changed data becomes searchable in near real-time. Your index is stored in RAM to keep throughput high and to speed up document updates. You can also tell CloudSearch to re-index your documents; you'll need to do this after changing certain configuration options, such as stemming (converting variations of a word to a base word, such as "dogs" to "dog") or stop words (very common words that you don't want to index).
Amazon CloudSearch has a number of advanced search capabilities including faceting and fielded search:

Faceting allows you to categorize your results into sub-groups, which can be used as the basis for another search. You could search for "umbrellas" and use a facet to group the results by price, such as $1-$10, $10-$20, $20-$50, and so forth. CloudSearch will even return document counts for each sub-group.
Fielded searching allows you to search on a particular attribute of a document. You could locate movies in a particular genre or actor, or products within a certain price range.

 
Search Scaling
Behind the scenes, CloudSearch stores data and processes searches using search instances. Each instance has a finite amount of CPU power and RAM. As your data expands, CloudSearch will automatically launch additional search instances and/or scale to larger instance types. As your search traffic expands beyond the capacity of a single instance, CloudSearch will automatically launch additional instances and replicate the data to the new instance. If you have a lot of data and a high request rate, CloudSearch will automatically scale in both dimensions for you.

Amazon CloudSearch will automatically scale your search fleet up to a maximum of 50 search instances. We'll be increasing this limit over time; if you have an immediate need for more than 50 instances, please feel free to contact us and we'll be happy to help.

The net-net of all of this automation is that you don't need to worry about having enough storage capacity or processing power. CloudSearch will take care of it for you, and you'll pay only for what you use.

Pricing Model

The Amazon CloudSearch pricing model is straightforward:

You'll be billed based on the number of running search instances. There are three search instance sizes (Small, Large, and Extra Large) at prices ranging from $0.12 to $0.68 per hour (these are US East Region prices, since that's where we are launching CloudSearch).

There's a modest charge for each batch of uploaded data. If you change configuration options and need to re-index your data, you will be billed $0.98 for each Gigabyte of data in the search domain.
There's no charge for in-bound data transfer, data transfer out is billed at the usual AWS rates, and you can transfer data to and from your Amazon EC2 instances in the Region at no charge.

Advanced Searching

Like the other Amazon Web Services, CloudSearch allows you to get started with a modest effort and to add richness and complexity over time. You can easily implement advanced features such as faceted search, free text search, Boolean search expressions, customized relevance ranking, field-based sorting and searching, and text processing options such as stopwords, synonyms, and stemming.

CloudSearch Programming

You can interact with CloudSearch through the AWS Management Console, a complete set of Amazon CloudSearch APIs, and a set of command line tools. You can easily create, configure, and populate a search domain through the AWS Management Console.
Here's a tour, starting with the welcome screen:

Amazon CloudSearch
 
You start by creating a new Search Domain:

Amazon CloudSearch
 
You can then load some sample data. It can come from local files, an Amazon S3 bucket, or several other sources:

Amazon CloudSearch
 
Here's how you choose an S3 bucket (and an optional prefix to limit which documents will be indexed):

Amazon CloudSearch
 
You can also configure your initial set of index fields:

Amazon CloudSearch
 
You can also create access policies for the CloudSeach APIs:

Amazon CloudSearch
 
Your search domain will be initialized and ready to use within twenty minutes:

Amazon CloudSearch
 
Processing your documents is the final step in the initialization process:

Amazon CloudSearch
 
After your documents have been processed you can perform some test searches from the console:

Amazon CloudSearch
 
The CloudSearch console also provides you with full control over a number of indexing options including stopwords, stemming, and synonyms:



 
CloudSearch in Action
Some of our early customers have already deployed some applications powered by CloudSearch. Here's a sampling:
  • Search Technologies has used CloudSearch to index the Wikipedia (see the demo).
  • NewsRight is using CloudSearch to deliver search for news content, usage and rights information to over 1,000 publications.
  • ex.fm is using CloudSearch to power their social music discovery website.
  • CarDomain is powering search on their social networking website for car enthusiasts.
  • Sage Bionetworks is powering search on their data-driven collaborative biological research website.
  • Smugmug is using CloudSearch to deliver search on their website for over a billion photos.

SOURCE

    AWS Direct Connect - New Locations and Console Support

    On 13th August AWS has announced new locations and console support for AWS Direct Connect. Great article by Jeff...

    Did you know that you can use AWS Direct Connect to set up a dedicated 1 Gbps or 10 Gbps network connect from your existing data center or corporate office to AWS?

    New Locations

    Today we are adding two additional Direct Connect locations so that you have even more ways to reduce your network costs and increase network bandwidth throughput. You also have the potential for a more consistent experience. Here is the complete list of locations:
    If you have your own equipment running at one of the locations listed above, you can use Direct Connect to optimize the connection to AWS. If your equipment is located somewhere else, you can work with one of our APN Partners supporting Direct Connect to establish a connection from your location to a Direct Connection Location, and from there on to AWS.

    Console Support

    Up until now, you needed to fill in a web form to initiate the process of setting up a connection. In order to make the process simpler and smoother, you can now start the ordering process and manage your Connections through the AWS Management Console.
    Here's a tour. You can establish a new connection by selecting the Direct Connect tab in the console:

    AWS Direct connect Establish a new connection
     
    After you confirm your choices you can place your order with one final click:

    AWS Direct connect Establish a new connection
     
    You can see all of your connections in a single (global) list:

    AWS Direct connect connections
     
    You can inspect the details of each connection:

    AWS Direct connect - connection details
     
    You can then create a Virtual Interface to your connection. The interface can connected to one of your Virtual Private Clouds or it can connect to the full set of AWS services:

    AWS Direct connect

    AWS Direct connect
     
    You can even download a router configuration file tailored to the brand, model, and version of your router:

    AWS Direct connect
     
    Get Connected
    And there you have it! Learn more about AWS Direct Connect and get started today.

    SOURCE
     

    ALL about AWS EBS Provisioned IOPS - feature and resources

    AWS has recently announced EBS Provisioned IOPS feature, a new Elastic Block Store volume type for running high performance databases in the cloud. Provisioned IOPS are designed to deliver predictable, high performance for I/O intensive workloads, such as database applications, that rely on consistent and fast response times. With Provisioned IOPS, you can flexibly specify both volume size and volume performance, and Amazon EBS will consistently deliver the desired performance over the lifetime of the volume.
    Simple comparison between the standard volumes and provisioned IOPS volumes:

    Amazon EBS Standard volumes
    • Offer cost effective storage for applications with moderate or bursty I/O requirements.
    • Deliver approximately 100 IOPS on average with a best effort ability to burst to hundreds of IOPS.
    • Are also well suited for use as boot volumes, where the burst capability provides fast instance start-up times.
    • $0.10 per GB-month of provisioned storage
    • $0.10 per 1 million I/O requests

    Amazon EBS Provisioned IOPS volumes
    • Provisioned IOPS volumes are designed to deliver predictable, high performance for I/O intensive workloads such as databases.
    • Amazon EBS currently supports up to 1000 IOPS per Provisioned IOPS volume, with higher limits coming soon.
    • Provisioned IOPS volumes are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time.
    • $0.125 per GB-month of provisioned storage
    • $0.10 per provisioned IOPS-month
    AWS has compiled some interesting resources for the users:

    Our recent release of the EBS Provisioned IOPS feature (blog post, explanatory video, EBS home page) has met with a very warm reception. Developers all over the world are already making use of this important and powerful new EC2 feature. I would like to make you aware of some other new resources and blog posts to help you get the most from your EBS volumes.
    • The EC2 FAQ includes answers to a number of important performance and architecture questions about Provisioned IOPS.
    • The EC2 API tools have been updated and now support the creation of Provisioned IOPS volumes. The newest version of the ec2-create-volume tool supports the --type and --iops options. For example, the following command will create a 500 GB volume with 1000 Provisioned IOPS:
      $ ec2-create-volume --size 500 --availability-zone us-east-1b --type io1 --iops 1000
    • Eric Hammond has written a detailed migraton guide to show you how to convert a running EC2 instance to an EBS-Optimized EC2 instance with Provisioned IOPS volumes. It is a very handy post, and it also shows off the power of programmatic infrastructure.
    • I have been asked about the applicability of existing EC2 Reserved Instances to the new EBS-Optimized instances. Yes, they apply, and you pay only the additional hourly charge. Read our new FAQ entry to learn more.
    • I have also been asked about the availability of EBS-Optimized instances for more instance types. We intend to support other instance types based on demand. Please feel free to let us know what you need by posting a comment on this blog or in the EC2 forum.
    • The folks at CloudVertical have written a guide to understanding new AWS I/O options and costs.
    • The team at Stratalux wrote a very informative blog post, Putting Amazon's Provisoned IOPS to the Test. Their conclusion:
      "Based upon our tests PIOPS definitely provides much needed and much sought after performance improvements over standard EBS volumes. I’m glad to see that Amazon has heeded the calls of its customers and developed a persistent storage solution optimized for database workloads."
    EBS Provisioned IOPS We have also put together a new guide to benchmarking provisioned IOPS volumes. The guide shows you how to set up and run high-quality, repeatable benchmarks on Linux and Windows using the fio, Oracle Orion, and SQLIO tools. The guide will walk you through the following steps:
    • Launching an EC2 instance.
    • Creating Provisioned IOPS EBS volumes.
    • Attaching the volumes to the instance.
    • Creating a RAID from the volumes.
    • Installing the appropriate benchmark tool.
    • Benchmarking the I/O performance of your volumes.
    • Deleting the volumnes and terminate the instance.
    Since I like to try things for myself, I created six 100 GB volumes, each provisioned for 1000 IOPS:
    EBS Provisioned IOPS
    Then I booted up an EBS-Optimized EC2 instance, built a RAID, and ran fio. Here's what I saw in the AWS Management Console's CloudWatch charts after the run. Each volume was delivering 1000 IOPS, as provisioned:
    EBS Provisioned IOPS
    Here's an excerpt from the results:
    fio_test_file: (groupid=0, jobs=32): err= 0: pid=23549: Mon Aug 6 14:01:14 2012
    read : io=123240MB, bw=94814KB/s, iops=5925 , runt=1331000msec
    clat (usec): min=356 , max=291546 , avg=5391.52, stdev=8448.68
    lat (usec): min=357 , max=291547 , avg=5392.91, stdev=8448.68
    clat percentiles (usec):
    | 1.00th=[ 418], 5.00th=[ 450], 10.00th=[ 478], 20.00th=[ 548],
    | 30.00th=[ 596], 40.00th=[ 668], 50.00th=[ 892], 60.00th=[ 1160],
    | 70.00th=[ 3152], 80.00th=[10432], 90.00th=[20864], 95.00th=[26752],
    | 99.00th=[29824], 99.50th=[30336], 99.90th=[31360], 99.95th=[31872],
    | 99.99th=[37120]
    Read the benchmarking guide to learn more about running the benchmarks and interpreting the results.

    SOURCE for resources : http://aws.typepad.com/aws/2012/08/ebs-provisioned-iops-some-interesting-resources.html

    Announcing AWS Elastic Beanstalk support for Python, and seamless database integration


    It’s a good day to be a Python developer: AWS Elastic Beanstalk now supports Python applications! If you’re not familiar with Elastic Beanstalk, it’s the easiest way to deploy and manage scalable PHP, Java, .NET, and now Python applications on AWS. You simply upload your application, and Elastic Beanstalk automatically handles all of the details associated with deployment including provisioning of Amazon EC2 instances, load balancing, auto scaling, and application health monitoring.

    Elastic Beanstalk supports Python applications that run on the familiar Apache HTTP server and WSGI. In other words, you can run any Python applications, including your Django applications, or your Flask applications. Elastic Beanstalk supports a rich set of tools to help you develop faster. You can use eb and Git to quickly develop and deploy from the command line. You can also use the AWS Management Console to manage your application and configuration.

    The Python release brings with it many platform improvements to help you get your application up and running more quickly and securely. Here are a few of the highlights below:

    Integration with Amazon RDS

    Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud, making it a great fit for scalable web applications running on Elastic Beanstalk.

    If your application requires a relational database, Elastic Beanstalk can create an Amazon RDS database instance to use with your application. The RDS database instance is automatically configured to communicate with the Amazon EC2 instances running your application.
     
    AWS RDS Configuration Details

    A console screenshot showing RDS configuration options when launching a newAWS Elastic Beanstalk environment.

    Once the RDS database instance is provisioned, you can retrieve information about the database from your application using environment variables:



    import os
    if 'RDS_HOSTNAME' in os.environ:
    DATABASES = {
    'default': {
    'ENGINE': 'django.db.backends.mysql',
    'NAME': os.environ['RDS_DB_NAME'],
    'USER': os.environ['RDS_USER'],
    'PASSWORD': os.environ['RDS_PASSWORD'],
    'HOST': os.environ['RDS_HOSTNAME'],
    'PORT': os.environ['RDS_PORT'],
    }
    }


    To learn more about using Amazon RDS with Elastic Beanstalk, visit “Using Amazon RDS with Python” in the Developer Guide.

    Customize your Python Environment
    You can customize the Python runtime for Elastic Beanstalk using a set of declarative text files within your application. If your application contains a requirements.txt in its top level directory, Elastic Beanstalk will automatically install the dependencies using pip.

    Elastic Beanstalk is also introducing a new configuration mechanism that allows you to install packages from yum, run setup scripts, and set environment variables. You simply create a “.ebextensions” directory inside your application and add a “python.config” file in it. Elastic Beanstalk loads this configuration file and installs the yum packages, runs any scripts, and then sets environment variables. Here is a sample configuration file that syncs the database for a Django application:


    commands:
    syncdb:
    command: "django-admin.py syncdb --noinput"
    leader_only: true
    option_settings:
    "aws:elasticbeanstalk:application:python:environment":
    DJANGO_SETTINGS_MODULE: "mysite.settings"
    "aws:elasticbeanstalk:container:python":
    WSGIPath: "mysite/wsgi.py"


    Snapshot your logs

    To help debug problems, you can easily take a snapshot of your logs from the AWS Management console. Elastic Beanstalk aggregates the top 100 lines from many different logs, including the Apache error log, to help you squash those bugs.
     
    Elastic Beanstalk Console-snapshot-logs

    The snapshot is saved to S3 and is automatically deleted after 15 minutes. Elastic Beanstalk can also automatically rotate the log files to Amazon S3 on an hourly basis so you can analyze traffic patterns and identify issues. To learn more, visit “Working with Logs” in the Developer Guide.

    Support for Django and Flask

    Using the customization mechanism above, you can easily deploy and run your Django and Flask applications on Elastic Beanstalk.
    For more information about using Python and Elastic Beanstalk, visit the Developer Guide.