AWS: Difference between revisions

From Dave's wiki
Jump to navigation Jump to search
(Object and key-value storage)
Line 59: Line 59:
=== Simple Storage Service ===
=== Simple Storage Service ===


Provides highly scalable object storage in the form of unstructured collections of bits.
Simple Storage Service (S3) is one of the richest, most flexible, and most widely used AWS. It provides highly scalable object storage in the form of unstructured collections of bits and is used in many different applications such as:
 
* Dropbox uses S3 to store all of the documents it stores on behalf of its users.
* Netflix operates almost 100 percent on AWS and uses S3 to store videos before they go out to its Content Delivery Network.
* Medcommons stores customers' health records in S3 and is in compliance with the strict requirements of the Health Insurance Portability and Accountability Act (HIPAA).
 
S3 objects are treated as web objects and they are accessed via Internet protocols using a URL identifier. Every S3 object has a unique URL in the format:
 
<blockquote>https://s3.amazonaws.com/bucket/key</blockquote>
 
A bucket in AWS is a group of objects and the bucket name is associated with an account, but it can be named anything. However, the bucket namespace is completely flat, meaning that every bucket name must be unique among all users of AWS. One account is limited to 100 buckets and bucket names have certain [https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucketnamingrules.html restrictions].
 
A key in AWS is the name of an object and it acts as an identifier to locate data associated with the key. In AWS, a key can either be an object name or a more complex arrangement with some structure. In this way, a key can be used to mimic directory-like or URL-like formats for object names, despite not actually representing the actual structure of the S3 storage system.
 
An S3 object is simply a collection of bytes and the only restriction is that the object is limited to 5TB in size.
 
S3 can be accessed via an API and it supports both SOAP and REST interfaces. In addition to post, get, and delete, S3 offers a wide range of object management actions such as an API call to get the version number of an object. While you can't update the same object, you can store different versions of an object.
 
The Access Control List (ACL) is used to control accessibility of S3 objects. There are four types of people who can access S3 objects:
 
* Owner: the person who created the object has permissions to read or delete the object
* Specific users or groups: particular users or groups of users within AWS
* Authenticated users: people who have accounts within AWS and have been successfully authenticated
* Everyone: anyone on the Internet
 
AWS as a whole is organised into regions and each contains one or more availability zones (AZs). Although S3 locates buckets within regions, bucket names are unique across all S3 regions even though buckets reside in particular regions. When an AWS VM needs to access an S3 object, there is no charge for the network traffic if the VM and S3 object reside in the same AWS region. CloudFront lets you store only one copy of an object and it is made available in every region.
 
S3 have a simple cost structure:
 
1. You pay per gigabyte of storage used by your objects
2. You pay for API calls to S3
3. You pay for the network traffic generated by the delivery of S3 objects
 
Transferring data into S3 storage is free and there's no charge for the first gigabyte of outbound traffic. (Might be outdated, please check.)


=== Elastic Block Storage ===
=== Elastic Block Storage ===

Revision as of 02:19, 18 October 2022

Setting up a new instance

See also Docker

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install libssl-dev libcurl4-openssl-dev libxml2-dev libnlopt-dev libhdf5-dev python-pip
pip install --upgrade pip

Git LFS

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

https://help.github.com/articles/connecting-to-github-with-ssh/

ssh-keygen -t rsa -b 4096 -C "your_email@example.com"
eval "$(ssh-agent -s)"

In ~/.ssh/config, add (https://superuser.com/questions/232373/how-to-tell-git-which-private-key-to-use):

host github.com
 HostName github.com
 IdentityFile ~/.ssh/id_rsa_github
 User git

Install awscli

pip install awscli
aws configure
aws s3 sync ./logdata s3://bucketname/

Storage

Some background first. Network-attached storage (NAS) operates as an extension of the server's local file system and is used like local files, e.g. reads and writes operate the same as though the file were located on the server's hard disk. NAS makes storage look like it's part of the local server.

Storage-area networks (SAN) offer remote storage that is separate from the local server and that storage does not appear as local to the server. Instead, the server must operate a special protocol to communicate with the SAN device; you can say that the SAN device offers detached storage that the server must make special arrangements to use.

Two new storage types are now available:

  • Object: reliably stores and retrieves unstructured digital objects
  • Key-value: manages structured data

Object storage provides the ability to store objects, which are essentially collections of digital bits. Object storage offers the reliable (and highly scalable) storage of collections of bits, but imposes no structure on the bits. The structure is chosen by the user, who needs to know specify details about the object, such as the format and the manipulation methods of the object. The object storage service simply provides reliable storage of the bits.

Object storage differs from file storage, which offers an update functionality, but object storage does not. Instead, you have to update the object locally and then insert the object into the object store. For the new version to have the same name as the old version, the original object needs to be deleted before inserting the new object with the same name.

Distributed key-value storage provides structured storage that is somewhat like a database but different in important ways in order to provide additional scalability and performance. Newer distributed key-value storage products are designed to support huge amounts of data by spreading across many servers. Key-value storage systems often make use of redundancy within hardware resources to prevent outages; the use of redundancy makes the key-value system always available.

Key-value storage systems have these common characteristics:

  • Data is structured with a single unique key that is used to identify the record
  • Retrieval is restricted to the key value; this has the disadvantage that every record has to be examined to find all records with a common variable that is not the key
  • No support exists for performing searches across multiple datasets with common data elements, i.e. they do not support joins.

Key-value storage represents a trade-off between ease of use and scalability, where scalability is favoured over ease of use.

Simple Storage Service

Simple Storage Service (S3) is one of the richest, most flexible, and most widely used AWS. It provides highly scalable object storage in the form of unstructured collections of bits and is used in many different applications such as:

  • Dropbox uses S3 to store all of the documents it stores on behalf of its users.
  • Netflix operates almost 100 percent on AWS and uses S3 to store videos before they go out to its Content Delivery Network.
  • Medcommons stores customers' health records in S3 and is in compliance with the strict requirements of the Health Insurance Portability and Accountability Act (HIPAA).

S3 objects are treated as web objects and they are accessed via Internet protocols using a URL identifier. Every S3 object has a unique URL in the format:

https://s3.amazonaws.com/bucket/key

A bucket in AWS is a group of objects and the bucket name is associated with an account, but it can be named anything. However, the bucket namespace is completely flat, meaning that every bucket name must be unique among all users of AWS. One account is limited to 100 buckets and bucket names have certain restrictions.

A key in AWS is the name of an object and it acts as an identifier to locate data associated with the key. In AWS, a key can either be an object name or a more complex arrangement with some structure. In this way, a key can be used to mimic directory-like or URL-like formats for object names, despite not actually representing the actual structure of the S3 storage system.

An S3 object is simply a collection of bytes and the only restriction is that the object is limited to 5TB in size.

S3 can be accessed via an API and it supports both SOAP and REST interfaces. In addition to post, get, and delete, S3 offers a wide range of object management actions such as an API call to get the version number of an object. While you can't update the same object, you can store different versions of an object.

The Access Control List (ACL) is used to control accessibility of S3 objects. There are four types of people who can access S3 objects:

  • Owner: the person who created the object has permissions to read or delete the object
  • Specific users or groups: particular users or groups of users within AWS
  • Authenticated users: people who have accounts within AWS and have been successfully authenticated
  • Everyone: anyone on the Internet

AWS as a whole is organised into regions and each contains one or more availability zones (AZs). Although S3 locates buckets within regions, bucket names are unique across all S3 regions even though buckets reside in particular regions. When an AWS VM needs to access an S3 object, there is no charge for the network traffic if the VM and S3 object reside in the same AWS region. CloudFront lets you store only one copy of an object and it is made available in every region.

S3 have a simple cost structure:

1. You pay per gigabyte of storage used by your objects 2. You pay for API calls to S3 3. You pay for the network traffic generated by the delivery of S3 objects

Transferring data into S3 storage is free and there's no charge for the first gigabyte of outbound traffic. (Might be outdated, please check.)

Elastic Block Storage

Provides highly available and reliable data volumes that can be attached to a virtual machine, detached, and then reattached to another VM.

Glacier

A data archiving solution; provides low-cost, highly robust archival data storage and retrieval.

DynamoDB

Key-value storage; provides highly scalable, high-performance storage based on tables indexed by data values referred to as keys.

In this tutorial, you will learn how to create a simple table, add data, scan and query the data, delete data, and delete the table by using the DynamoDB console.

https://aws.amazon.com/getting-started/tutorials/create-nosql-table/

Getting Started with Amazon DynamoDB - https://aws.amazon.com/dynamodb/getting-started/

Installing R

https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-16-04-2

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/'
sudo apt-get update
sudo apt-get install r-base

https://www.digitalocean.com/community/tutorials/how-to-install-r-on-ubuntu-18-04

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/'
sudo apt update
sudo apt install r-base

Installing RStudio Server

See https://www.rstudio.com/products/rstudio/download-server/

sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-1.1.456-amd64.deb
sudo gdebi rstudio-server-1.1.456-amd64.deb
sudo rstudio-server verify-installation

https://support.rstudio.com/hc/en-us/articles/200552316-Configuring-the-Server

Add user for RStudio

sudo useradd blah
sudo passwd blah
sudo mkdir /home/blah
sudo chown blah:blah /home/blah

Installing R Shiny Server

See https://www.rstudio.com/products/shiny/download-server/ and http://docs.rstudio.com/shiny-server/#getting-started

sudo apt-get install gdebi-core
wget https://download3.rstudio.org/ubuntu-14.04/x86_64/shiny-server-1.5.7.907-amd64.deb
sudo gdebi shiny-server-1.5.7.907-amd64.deb

Start R and install these packages

# global installation of packages
sudo R
install.packages("shiny")
install.packages("rmarkdown")

Starting and stopping

sudo systemctl start shiny-server
sudo systemctl stop shiny-server

Config file in /etc/shiny-server/shiny-server.conf

run_as shiny;
preserve_logs true;

app_init_timeout 1800;
app_idle_timeout 1800;
http_keepalive_timeout 1800;
sockjs_heartbeat_delay 500;
disable_websockets off;

# Define a server that listens on port 3838
server {
  listen 3838;

  # Define a location at the base URL
  location / {

    # Host the directory of Shiny Apps stored in this directory
    site_dir /srv/shiny-server;

    # Log all Shiny output to files in this directory
    log_dir /var/log/shiny-server;

    # When a user visits the base URL rather than a particular application,
    # an index of the applications available in this directory will be shown.
    directory_index on;
  }
}

Genomics in the cloud

https://aws.amazon.com/health/genomics/

https://docs.opendata.aws/genomics-workflows/

Miscellaneous

Making an Amazon EBS Volume Available for Use -> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html

lsblk
sudo file -s /dev/xvdb
/dev/xvdb: data

# ONLY RUN THE STEP BELOW IF YOU ARE MOUNTING AN EMPTY VOLUME
sudo mkfs -t ext4 /dev/xvdb

sudo mkdir /data
sudo mount /dev/xvdb /data

RAID

yum install -y mdadm
umount /media/ephemeral0
umount /media/ephemeral1
yes | mdadm --create --verbose --auto=yes /dev/md0 --level=0 --raid-devices=2 /dev/xvdb /dev/xvdc
mkfs.ext4 /dev/md0
mkdir -p /mnt
mount /dev/md0 /mnt

Tutorials