• Nem Talált Eredményt

Indexing Techniques

N/A
N/A
Protected

Academic year: 2023

Ossza meg "Indexing Techniques"

Copied!
48
0
0

Teljes szövegt

(1)

Indexing Techniques

(2)

The Problem

• What can we introduce to make search more efficient?

Indices!

• What is an index?

… …

(3)

Definitions

• Index: an auxiliary data structure to speed up record retrieval

Search key: the field/s of a table which is/are indexed

• Storage: index files that contain index records

– Each entry storing

• Actual data record

• or, search key value k and record ID <k,rid>

• or, search key value k and list of records IDs <k,rid list>

• Types: ordered and unordered (hash) indices

Paul Anna

Tim

(4)

Types of Ordered Indices (1/3)

• Assuming ordered data files

• Depending on which field is indexed

– Primary index: search key is ordering key field

• Pointer for each page

– Secondary index: search key is non ordering field

Paul 00112233

Anna 00112234

Matt 00112235

Tim 00112236

Carol 00112237

00112233 00112235 00112236 00112238

Anna Carol Paul

Tim

primary

secondary

(5)

Types of Ordered Indices (2/3)

• Depending on the density of index records

– Dense index: an index record for each distinct search key value, ie every record

– Sparse index: index records for only some search key values

• search key value for first record in page

• pointer to page

Paul 00112233

Anna 00112234

Matt 00112235

Tim 00112236

Carol 00112237

00112233 00112235 00112236 00112238

sparse

00112233 00112234 00112235 00112236 00112237 00112238

dense

(6)

Types of Ordered Indices (3/3)

• Ordering field is nonkey (may have duplicates)

– Clustered index – Unclustered index

Paul 00112233

Anna 00112234

Matt 00112235

Tim 00112236

Carol 00112237

Rob 00112238

Paul 01112233

Tim 01112236

Tim 02112236

Anna Carol Matt Paul Rob

Tim 00112233

00112234 00112235 00112236 00112237 00112238 01112233 01112236 02112236

clustered

unclustered

(7)

Indices Exercise

• 215 records

• 128 bytes/record

• 210 bytes/page

• ordered file equality search on ordering field, unspanned organization

– without an index – with a primary index

• on field of size 12 bytes

• assume pointer 4 bytes long

(8)

Multi-level Indices (1/2)

• If access using first-level index is still expensive

• Build a sparse index on the first-level index

– Multi-level Index

• Fan-out: index blocking factor

Paul 00112233

Anna 00112234

Matt 00112235

Tim 00112236

Carol 00112237

00112233 00112234 00112235 00112236 00112237 00112238

00112233 00112235 00112236

first-level index

second-level index

(9)

Multi-level Indices (2/2)

• 26 index records/page (fan-out)

• 215 index records

• 1st-level

– 29 pages

• 2nd-level

– 29 index records – 23 pages

• 3rd-level

– 23 index records – 1 page

• 1 <= 215 / (26)t

• t = ceil(log26 215 ) = 3

• t = ceil(logfo#index-records)

(10)

Dynamic multi-level indices

• So far assumed indices are physically ordered files

– expensive insertions and deletions

• Dynamic multi-level indices

– B trees – B+ trees

(11)

Tree-structured Indices

• For each node: K1 < K2 < … Kq-1

• For each value X in subtree pointed to by Pi

– Ki-1< X < Ki, 1<i<q – X < Ki, i=1

– Ki-1< X, i=q

P1 K1 … Ki-1 Pi Ki … Kq-1 Pq

X X X

(12)

B tree

• Problems: empty nodes, unbalanced trees

– solution: B trees

(13)

B tree: Definition

• Each node: <P1,<K1, Pr1>, P2,…,<Kq-1, Prq-1>, Pq>

• Pi tree pointer, Ki search value, Pri data pointer

• For each node: K1 < K2 < … Kq-1

• For each value X in subtree pointed to by Pi

– Ki-1< X < Ki, 1<i<q – X < Ki, i=1

– Ki-1< X, i=q

• Each node at most q pointers

– B tree is order q

• Each node at least ceil(q/2) tree pointers

– except from root

• Internal node with p pointers has p-1 values

• All leaves at the same level

– balanced tree

(14)

B tree: Example

5 8

ø 1 ø 3 ø ø 6 ø 7 ø ø 9 ø 12 ø

tree pointer data pointer

ø

null pointer

(15)

B

+

tree

• Most implementations of B tree are B+ tree

• Data pointers only in leaves

– more entries in internal nodes than regular B trees – less internal nodes

– less levels – faster access

(16)

B

+

tree: Definition

• Internal nodes: <P1,K1, P2,…, Pq-1, Kq-1, Pq>

• Leaf nodes: <<K1, Pr1>, <K2, Pr2>,…,<Kp-1, Prp-1>, Pnext>

• Pri points a data records or block of pointers of such records

• leaf order

120 150 180

150 156 179 180 200

100 101 110 120 130

(17)

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35

120 150 180 30

100

B+ tree: Search

• At each level, find smallest Ki larger than search key

• Follow associated pointer Pi

(18)

B+ tree: Insert

• Nodes may overflow or underflow

• Ignoring overflow or underflow

• Inserting data record with with search key value k

– find leaf node – if k found

• add record to file, create indirect block if there isn’t one

• add record pointer to indirect block

– if k not found

• add data record to file

• insert record pointer in leaf node (all search keys in order)

(19)

B+ tree: Delete

• Ignoring overflow or underflow

• Find leaf node with search key value k

• Find data record pointer, delete record

• delete index record

– and indirect block, if any, if empty

(20)

B+ tree: Simple Insert

• Insert 42

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35

120 150 180 30

k < 100 100

42

(21)

B+ tree: Leaf Overflow (1/2)

• Insert 9

100 101 110 120 130 150 156 179 180 200

3 5 11 30 35 42

120 150 180 30

k < 100 100

(22)

B+ tree: Leaf Overflow (2/2)

• first ceil(n/2) in existing node, rest in new leaf node

• n=3+1=4

100 101 110 120 130 150 156 179 180 200 120 150 180

9 30

k < 100 100

3 5 9 11 30 35 42

(23)

9 30

k < 100

3 5 9 11 30 35 42

B+ tree: Internal Node Overflow (1/3)

• Insert 210, insert 205

100 101 110 120 130 150 156 179 180 200 210 120 150 180

100

(24)

B+ tree: Internal Node Overflow (2/3)

• Leaf Split

9 30

k < 100

3 5 9 11 30 35 42

100 101 110 120 130 150 156 179 180 200 120 150 180

100

205 210

(25)

B+ tree: Internal Node Overflow (3/3)

9 30

k < 100

3 5 9 11 30 35 42

100 101 110 120 130 150 156 179 180 200 120

100 150

205 210 180 205

(26)

B+ tree: New Root (1/2)

• Insert 210, insert 205

100 101 110 120 130 150 156 179 180 200 120 150 180

205 210

(27)

B+ tree: New Root (2/2)

180 205

100 101 110 120 130 150 156 179 180 200 120

205 210 150

(28)

Index Insert Exercise

• Insert 8, 7, 41

9 30

3 5 9 11 30 35 42

(29)

B+ tree: Delete

• Simple delete case

• Underflow case:

– redistribute records – coalesce with siblings – update parents

(30)

B+ tree: Simple Delete (1/2)

• Delete 110

180 205

100 101 110 120 130 150 156 179 180 200 120

205 210 215 150

(31)

B+ tree: Simple Delete (2/2)

• Leaf Updated

180 205

100 101 120 130 150 156 179 180 200

120

205 210 215 150

(32)

B+ tree: Delete Redistribution (1/2)

• Delete 180

180 205

100 101 120 130 150 156 179 180 200

120

205 210 215 150

(33)

B+ tree: Delete Redistribution (2/2)

• Redistribute entries

– left or right sibling

179 205

100 101 120 130 150 156 179 200

120

205 210 150

(34)

B+ tree: Delete Coalesce (1/4)

• Delete 101

179 205

100 101 120 130 150 156 179 200

120

205 210 215 150

(35)

B+ tree: Delete Coalesce (2/4)

• Leaf updated

• No redistribution

– sibling coalesce

179 205

100 120 130 150 156 179 200

120

205 210 215 150

(36)

B+ tree: Delete Coalesce (3/4)

• Leaf updated

• No redistribution

– sibling coalesce

179 205

100 120 130 150 156 179 200

205 210 215 150

(37)

B+ tree: Delete Coalesce (4/4)

• Redistribution

205

100 120 130 150 156 179 200

150

205 210 215 179

(38)

Hashing Techniques

(39)

Static Hashing (1/2)

• Store records in buckets with overflow chains

• Allocate a fixed number of buckets M

• Problems:

– small M

• long overflow chains, slow search-delete-insert

null

h

null

(40)

Static Hashing (2/2)

• Problems:

– large M

• wasted space, slow scan null

h

null

null

(41)

Dynamic Hashing

• Splitting and coalescing buckets as the database grows-shrinks

• One scheme: Extendible Hashing

• Hash function generates large values, eg 32 bits

– use i bits, change i as database size changes

• If overflow, double the number of buckets

– use i+1 bits of the hash function

– but, expensive: read all pages M and distribute records in 2*M pages

• solution: use a directory and double the size of the directory

– only split bucket that overflowed

(42)

Extendible Hashing (1/4)

h(18) = 10010

2

01 00

11 10

16 20 2

1 2

2

Directory

Buckets 3 7 2

A

B

C

D

18

(43)

Extendible Hashing (2/4)

h(4) = 00100

2

01 00

11 10

16 20 2

1 2

2

3 7 2

A

B

C

D

18

(44)

Extendible Hashing (3/4)

2

01 00

11 10

16 3

1 2

2

3 7 2

A

B

C

D

18

3

(45)

Extendible Hashing (4/4)

3

001 000

011 010

16 3

1 2

2

3 7 2

A

B

C

D

18

3

101 100

111 110

• Global Depth

• Local Depth

• If bucket full:

– split bucket – increment LD

• If GD=LD

– increment GD – double directory

(46)

Extendible Hashing: Delete

• If deletion make bucket empty

– merge with split image

• If directory pointers point to same bucket as split image

– directory halved

(47)

Extendible Hashing: Summary

• Avoids overflow pages

• Directory can get large

• Key search requires just 2 page reads

• Space utilization fluctuates

– 59-90% for uniformly distributed records

(48)

Extendible Hashing: Exercise

• Initially GD = LD = 1

M = 2 buckets

• Hash function: h(k) = k mod 2i

• inserts: 14, 18, 22, 3, 9

• deletes 9, 22, 3

1

01 00

12 8 1

5

1

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

Working over an infinite field of positive characteristic, an upper bound is given for the nilpotency index of a finitely generated nil algebra of bounded nil index n in terms of

1) In the Czech Republic, there is an established system of elementary art schools, which are focused on music field, art field, literary-dramatic field and dance field. Pupils

DE C7240 Information analysis and indexing; C7250R Information retrieval techniques; C6170 Expert systems and other Al software and techniques. PY 2004 PD 2004

INGWERSEN, P.-WORMELL, I.: Modem Indexing and retrieval techniques matching different types of Information needs.. Proceedings of

Within the scope of the statistical business process and data access services, data protection is a key element, which means the protection of the individual data of data

data management plans: a knowledge retrieval approach to automate the assessment of funders’ requirements. Advances in Information Retrieval. Machine-actionable data management plans

The magnetic field at the position of the falling bead is the sum of the external field and the field due to induced magnetic momenta, which are represented by

The aim of this section is to find an optimal speed ˙ ξ 0 , which guarantees the minimization of control force (fuel consumption) and the di ff erence of the speed proposed by