• Nem Talált Eredményt

6. Data Persistence

N/A
N/A
Protected

Academic year: 2022

Ossza meg "6. Data Persistence"

Copied!
87
0
0

Teljes szövegt

(1)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

6. Data Persistence

Vilmos Bilicki PhD University of Szeged

Department of Software Engineering

Program systems development

(2)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of Data

§ Data can be broadly classified into four types:

1. Structured Data:

§ Have a predefined model, which organizes data into a form that is relatively easy to store, process, retrieve and manage

§ E.g., relational data

2. Unstructured Data:

§ Opposite of structured data

§ E.g., Flat binary files containing text, video or audio

§ Note: data is not completely devoid of a structure (e.g., an audio file may still have an encoding structure and some metadata associated with it)

(3)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of Data

§ Data can be broadly classified into four types:

3. Dynamic Data:

§ Data that changes relatively frequently

§ E.g., office documents and transactional entries in a financial database

4. Static Data:

§ Opposite of dynamic data

§ E.g., Medical imaging data from MRI or CT scans

(4)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Why Classifying Data?

§ Segmenting data into one of the following 4 quadrants can help in designing and developing a pertaining storage

solution

§ Relational databases are usually used for structured data

§ File systems or NoSQL databases can be used for (static), unstructured data (more on these later)

Media Production, eCAD, mCAD, Office

Docs

Media Archive, Broadcast, Medical

Imaging Transaction Systems,

ERP, CRM BI, Data Warehousing Dynamic

UnstructuredStructured

Static

(5)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

History of the World, Part 1

4Relational Databases – mainstay of business 4Web-based applications caused spikes

Especially true for public-facing e-Commerce sites

4Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the

application (ie. Ehcache)

(6)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Scaling Up

4Issues with scaling up when the dataset is just too big 4RDBMS were not designed to be distributed

4Began to look at multi-node database solutions 4Known as ‘scaling out’ or ‘horizontal scaling’

4Different approaches include:

Master-slave

Sharding

(7)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Scaling RDBMS – Master/Slave

4Master-Slave

All writes are written to the master. All reads performed against the replicated slave databases

Critical reads may be incorrect as writes may not have been propagated down

Large data sets can pose problems as master needs to duplicate data to slaves

(8)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Why Replicating Data?

§ Replicating data across servers helps in:

§ Avoiding performance bottlenecks

§ Avoiding single point of failures

§ And, hence, enhancing scalability and availability

(9)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Why Replicating Data?

§ Replicating data across servers helps in:

§ Avoiding performance bottlenecks

§ Avoiding single point of failures

§ And, hence, enhancing scalability and availability

Main Server

Replicated Servers

(10)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

But, Consistency Becomes a Challenge

§ An example:

§ In an e-commerce application, the bank database has been replicated across two servers

§ Maintaining consistency of replicated data is a challenge

Bal=1000 Bal=1000

Replicated Database

Event 1 = Add $1000 Event 2 = Add interest of 5%

Bal=2000

1 2

Bal=1050

3 Bal=2050

4

Bal=2100

(11)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

The Two-Phase Commit Protocol

§ The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency

Database Server 1 Participant 1

Coordinator Participant 2 Database Server 2

Database Server 3 Participant 3

VOTE_REQUEST

VOTE_REQUEST

VOTE_REQUEST

Phase I: Voting

VOTE_COMMIT

VOTE_COMMIT VOTE_COMMIT

(12)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

The Two-Phase Commit Protocol

§ The two-phase commit protocol (2PC) can be used to ensure atomicity and consistency

Database Server 1 Participant 1

Coordinator Participant 2 Database Server 2

Database Server 3 Participant 3

GLOBAL_COMMIT

GLOBAL_COMMIT

GLOBAL_COMMIT

Phase II: Commit

LOCAL_COMMIT

LOCAL_COMMIT

LOCAL_COMMIT

“Strict” consistency, which limits scalability!

(13)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Scaling RDBMS - Sharding

4Partition or sharding

Scales well for both reads and writes

Not transparent, application needs to be partition-aware

Can no longer have relationships/joins across partitions

Loss of referential integrity across shards

(14)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Why Sharding Data?

§ Data is typically sharded (or striped) to allow for concurrent/parallel accesses

Input data: A large file

Machine 1

Chunk1 of input data

Machine 2

Chunk3 of input data

Machine 3

Chunk5 of input data Chunk2 of input data Chunk4 of input data Chunk5 of input data

E.g., Chunks 1, 3 and 5 can be accessed in parallel

(15)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Amdahl’s Law

§ How much faster will a parallel program run?

§ Suppose that the sequential execution of a program takes T1 time units and the parallel execution on p

processors/machines takes Tp time units

§ Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is

parallelizable

§ Then the speedup (Amdahl’s formula):

15

(16)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Amdahl’s Law: An Example

§ Suppose that:

§ 80% of your program can be parallelized

§ 4 machines are used to run your parallel version of the program

§ The speedup you can get according to Amdahl’s law is:

16

Although you use 4 processors you cannot get a speedup more than 2.5 times!

(17)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Real Vs. Actual Cases

§ Amdahl’s argument is too simplified

§ In reality, communication overhead and potential workload imbalance exist upon running parallel programs20 80

20 20

Process 1

Process 2

Process 3

Process 4 Serial Parallel

1. Parallel Speed-up: An Ideal Case

Cannot be parallelized Can be parallelized

20 80

20 20

Process 1

Process 2

Process 3

Process 4 Serial Parallel

2. Parallel Speed-up: An Actual Case

Cannot be parallelized Can be parallelized

Load Unbalance

Communication overhead

(18)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Other ways to scale RDBMS

4Multi-Master replication

4INSERT only, not UPDATES/DELETES 4No JOINs, thereby reducing query time

This involves de-normalizing data

4In-memory databases

(19)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

What is NoSQL?

4Stands for Not Only SQL

4Class of non-relational data storage systems

4Usually do not require a fixed table schema nor do they use the concept of joins

4All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)

(20)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Why NoSQL?

4For data storage, an RDBMS cannot be the be-all/end- all

4Just as there are different programming languages, need to have other data storage tools in the toolbox 4A NoSQL solution is more acceptable to a client now

than even a year ago

Think about proposing a Ruby/Rails or Groovy/Grails solution now versus a couple of years ago

(21)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Dynamo and BigTable

4Three major papers were the seeds of the NoSQL movement

BigTable (Google)

Dynamo (Amazon)

Gossip protocol (discovery and error detection) Distributed key-value data store

Eventual consistency

■ CAP Theorem (discuss in a sec ..)

(22)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

CAP Theorem

4Three properties of a system: consistency, availability and partitions

4You can have at most two of these three properties for any shared-data system

4To scale out, you have to partition. That leaves either consistency or availability to choose from

In almost all cases, you would choose availability over consistency

(23)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Availability

4Traditionally, thought of as the server/process available five 9’s (99.999 %).

4However, for large node system, at almost any point in time there’s a good chance that a node is either down or there is a network disruption among the nodes.

Want a system that is resilient in the face of network disruption

(24)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Consistency Model

4A consistency model determines rules for visibility and apparent order of updates.

4For example:

Row X is replicated on nodes M and N

Client A writes row X to node N

Some period of time t elapses.

Client B reads row X from node M

Does client B see the write from client A?

Consistency is a continuum with tradeoffs

For NoSQL, the answer would be: maybe

CAP Theorem states: Strict Consistency can't be achieved at the same time as availability and partition-tolerance.

(25)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

The CAP Theorem

§ The limitations of distributed databases can be described in the so called the CAP theorem

§ Consistency: every node always sees the same data at any given instance (i.e., strict consistency)

§ Availability: the system continues to operate, even if

nodes in a cluster crash, or some hardware or software parts are down due to upgrades

§ Partition Tolerance: the system continues to operate in the presence of network partitions

CAP theorem: any distributed database with shared data, can have at most two of the three desirable properties, C, A or P

(26)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Large-Scale Databases

§ When companies such as Google and Amazon were designing large-scale databases, 24/7 Availability was a key

§ A few minutes of downtime means lost revenue

§ When horizontally scaling databases to 1000s of

machines, the likelihood of a node or a network failure increases tremendously

§ Therefore, in order to have strong guarantees on

Availability and Partition Tolerance, they had to

sacrifice “strict” Consistency (implied by the CAP

theorem)

(27)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Trading-Off Consistency

§ Maintaining consistency should balance between the strictness of consistency versus availability/scalability

§ Good-enough consistency depends on your application

(28)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Trading-Off Consistency

§ Maintaining consistency should balance between the strictness of consistency versus availability/scalability

§ Good-enough consistency depends on your application

Strict Consistency

Generally hard to implement, and is inefficient

Loose Consistency

Easier to implement, and is efficient

(29)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

The BASE Properties

§ The CAP theorem proves that it is impossible to guarantee strict Consistency and Availability while being able to

tolerate network partitions

§ This resulted in databases with relaxed ACID guarantees

§ In particular, such databases apply the BASE properties:

§ Basically Available: the system guarantees Availability

§ Soft-State: the state of the system may change over time

§ Eventual Consistency: the system will eventually become consistent

(30)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Eventual Consistency

§ A database is termed as Eventually Consistent if:

§ All replicas will gradually become consistent in the

absence of updates

(31)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Eventual Consistency

§ A database is termed as Eventually Consistent if:

§ All replicas will gradually become consistent in the absence of updates

Webpage-A

Event: Update Webpage-A Webpage-A

Webpage-A Webpage-A

Webpage-A

Webpage-A

Webpage-A Webpage-A

Webpage-A Webpage-A

Webpage-A

Webpage-A

(32)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Eventual Consistency:

A Main Challenge

§ But, what if the client accesses the data from different replicas?

Webpage-A

Event: Update Webpage-A Webpage-A

Webpage-A Webpage-A

Webpage-A

Webpage-A

Webpage-A Webpage-A

Webpage-A Webpage-A

Webpage-A

Webpage-A

Protocols like Read Your Own Writes (RYOW) can be applied!

(33)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Eventual Consistency

4When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

4For a given accepted update and a given node,

eventually either the update reaches the node or the node is removed from service

4Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID

(34)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

NoSQL Databases

§ To this end, a new class of databases emerged, which mainly follow the BASE properties

§ These were dubbed as NoSQL databases

§ E.g., Amazon’s Dynamo and Google’s Bigtable

§ Main characteristics of NoSQL databases include:

§ No strict schema requirements

§ No strict adherence to ACID properties

§ Consistency is traded in favor of Availability

(35)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of NoSQL Databases

§ Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Stores

Graph Databases

Key-Value Stores

Columnar Databases

(36)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Document Stores

§ Documents are stored in some standard format or encoding (e.g., XML, JSON, PDF or Office

Documents)

§ These are typically referred to as Binary Large Objects (BLOBs)

§ Documents can be indexed

§ This allows document stores to outperform traditional file systems

§ E.g., MongoDB and CouchDB (both can be

queried using MapReduce)

(37)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

2017. 03. 28. 37

(38)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of NoSQL Databases

§ Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Stores

Graph Databases

Key-Value Stores

Columnar Databases

(39)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Graph Databases

§ Data are represented as vertices and edges

§ Graph databases are powerful for graph-like queries (e.g., find the shortest path between two elements)

§ E.g., Neo4j and VertexDB

Id: 1 Name:

Alice Age: 18

Id: 2 Name: Bob

Age: 22

Id: 3 Name:

Chess Type:

Group

(40)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Social Network “path exists”

Performance

4Experiment:

• ~1k persons

• Average 50 friends per person

• pathExists(a,b) limited to depth 4

• Caches warm to eliminate disk IO

#

persons

query time Relational

database

1000 2000ms

Neo4j 1000 2ms

Neo4j 1000000 2ms

(41)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of NoSQL Databases

§ Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Stores

Graph Databases

Key-Value Stores

Columnar Databases

(42)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Key-Value Stores

§ Keys are mapped to (possibly) more complex value (e.g., lists)

§ Keys can be stored in a hash table and can be distributed easily

§ Such stores typically support regular CRUD (create, read, update, and delete) operations

§ That is, no joins and aggregate functions

§ E.g., Amazon DynamoDB and Apache Cassandra

(43)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Types of NoSQL Databases

§ Here is a limited taxonomy of NoSQL databases:

NoSQL Databases

Document Stores

Graph Databases

Key-Value Stores

Columnar Databases

(44)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Columnar Databases

§ Columnar databases are a hybrid of RDBMSs and Key-Value stores

§ Values are stored in groups of zero or more columns, but in Column-Order (as opposed to Row-Order)

§ Values are queried by matching keys

§ E.g., HBase and Vertica

Alice 3 25 Bob

4 19 Carol 0

45

Record 1

Row-Order

Alice

3 25

Bob 4

19

Carol 0

45

Column A

Columnar (or Column-Order)

Alice 3 25

Bob

4 19 Carol 0 45

Columnar with Locality Groups

Column A = Group A

Column Family {B, C}

(45)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

What am I giving up?

4joins

4group by 4order by

4ACID transactions

4SQL as a sometimes frustrating but still powerful query language

4easy integration with other applications that support SQL

(46)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Searching

4Relational

SELECT `column` FROM `database`,`table` WHERE `id`

= key;

SELECT product_name FROM rockets WHERE id = 123;

4Cassandra (standard)

keyspace.getSlice(key, “column_family”, "column")

■ keyspace.getSlice(123, new

ColumnParent(“rockets”),

getSlicePredicate());

(47)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Typical NoSQL API

4Basic API access:

get(key) -- Extract the value given a key

put(key, value) -- Create or update the value given its key

delete(key) -- Remove the key and its associated value

execute(key, operation, parameters) -- Invoke an

operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... etc).

(48)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Domain Model

4Design your domain model first

4Create your Cassandra data store to fit your domain model

<Keyspace Name="Acme">

<ColumnFamily CompareWith="UTF8Type" Name="Rockets" />

<ColumnFamily CompareWith="UTF8Type" Name="OtherProducts" />

<ColumnFamily CompareWith="UTF8Type" Name="Explosives" />

</Keyspace>

(49)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Data Model

ColumnFamily: Rockets

Key Value

1

2

3

Name Value

toon

inventoryQty brakes

Rocket-Powered Roller Skates Ready, Set, Zoom

5 false name

Name Value

toon

inventoryQty brakes

Little Giant Do-It-Yourself Rocket-Sled Kit Beep Prepared

4 false

Name Value

toon

inventoryQty wheels

Acme Jet Propelled Unicycle Hot Rod and Reel

1 1 name

name

(50)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

(51)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

4Data-intensive text processing 4Assembly of large genomes

4Graph mining

4Machine learning and data mining 4Large scale social network analysis

Uses for Hadoop

(52)

UNIVERSITY OF SZEGED Department of Software EngineeringUNIVERSITAS SCIENTIARUM SZEGEDIENSIS

Who Uses Hadoop?

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

UNIVERSITY OF SZEGED Department of Software Engineering UNIVERSITAS SCIENTIARUM SZEGEDIENSIS..

UNIVERSITY OF SZEGED Department of Software Engineering VERSITAS SCIENTIARUM SZEGEDIENSIS..

UNIVERSITY OF SZEGED Department of Software Engineering UNIVERSITAS SCIENTIARUM SZEGEDIENSIS.. Mobil alkalmazásfejlesztés - UI alapok

UNIVERSITY OF SZEGED Department of Software Engineering IVERSITAS SCIENTIARUM SZEGEDIENSIS.. Mobil alkalmazásfejlesztés -

UNIVERSITY OF SZEGED Department of Software Engineering UNIVERSITAS SCIENTIARUM SZEGEDIENSIS.. Mobil alkalmazásfejlesztés -

UNIVERSITY OF SZEGED Department of Software Engineering SITAS SCIENTIARUM SZEGEDIENSIS setMinimumLatency(long minLatencyMillis). ● A befejezés előtt megvárt minimális

UNIVERSITY OF SZEGED Department of Software Engineering SITAS SCIENTIARUM SZEGEDIENSIS.. Mobil alkalmazásfejlesztés

UNIVERSITY OF SZEGED Department of Software Engineering UNIVERSITAS SCIENTIARUM SZEGEDIENSIS.. Apple Swift alapú