«

»

Dec
24
2009

Aligning an SSD on Linux

I’ve got a small home server with a software RAID-5 for storing my files. It also runs a few virtual machines and acts as a NAT router for internet access. Nothing expensive, just some Frankensteinian patchwork built from old hardware left over when I upgraded my workstation. Nevertheless, I granted it a brand new Intel X25-M SSD last week.

Photo of an Intel X25-M SSD drive, which is a metal box smaller than a CD case

Did I mention that this server is running Gentoo Linux? I thought this would be a good time to do a fresh install and get everything right that might have gone wrong the first time. Besides, installing Linux always is an interesting (and masochistic) experience, especially when your chosen distribution has no installer :)

Because getting my partitions and file systems aligned also proved to be difficult task, I thought why not make a small article out of this!

Erase Block Size

SSDs always operate on entire blocks of memory. This is so because, before writing to a memory cell, flash memory needs to be erased, which requires the application of a large voltage to the memory cells, which can only happen to an entire memory cell block at once (probably because this kind of power would affect other cells around the one being erased, at least that’s my guess.)

Anyway, this means that if you write 1 KB of data to an SSD with an erase block size of 128 KB, the SSD needs to read 127 KB from the target block, erase the block and write the old data plus the new data back into the block. That’s something one just has to accept when using an SSD. Modern SSD firmware will do its best to pre-erase blocks when it’s idle and try to write new data into these pre-erased blocks (by mapping data to other locations on the drive without the knowledge of the OS.)

Still, watch what happens if a file system just sees the SSD as a brick of memory and writes data at a random position:

A box of cells with a small section highlighted that goes across a cell border

The SSD now has to erase and write two blocks, even though one would have sufficed for the amount of data being written. To fix this, the drive’s firmware would have to do data mapping on the byte level, which likely isn’t going to happen (in the worst case, you would need more memory for the remapping table than the drive’s capacity!)

If the file system’s write was aligned to a multiple of the SSD’s erase block size, the result would be this:

A box of cells with a small section highlighted that stays inside a single cell

Thus, it’s generally a good idea to make sure your file system’s writes are aligned to multiples of your SSD’s erase block size. As I found out, this isn’t quite as easy as it sounds. The first road block is already encountered when you partition a hard drive:

Partition Alignment

If the partitions of a hard drive aren’t aligned to begin at multiples of 128 KiB, 256 KiB or 512 KiB (depending on the SSD used), aligning the file system is useless because everything is skewed by the start offset of the partition. Thus, the first thing you have to take care of is aligning the partitions you create.

A spindle with three discs with a red ring superimposed on each of the discs
A cylinder.

A spindle with three discs with a red pie slice superimposed on each of the discs
A sector.

Traditionally, hard drives were addressed by indicating the cylinder, head and sector at which data was to be read or written. These represented the radial position, the drive head (= platter and side) and the axial position of the data respectively. With LBA (logical block addressing), this is no longer the case. Instead, the entire hard drive is addressed as one continuous stream of data.

Linux’ fdisk, however, still uses a virtual C-H-S system where you can define any number of heads and sectors yourself (the cylinders are calculated automatically from the drive’s capacity), with partitions always starting and ending at intervals of heads x cylinders. Thus, you need to choose a number of heads and sectors of which the SSD’s erase block size is a multiple.

I found two posts which detail this process: Aligning Filesystems to an SSD’s Erase Block Size and Partition alignment for OCZ Vertex in Linux. The first one recommends 224 heads and 56 sectors, but I can’t quite understand where those numbers come from, so I used the advice from the post on the OCZ forums with 32 heads and 32 sectors which means fdisk uses a cylinder size of 1024 bytes. And because fdisk partitions in units of 512 cylinders (= 512 x heads x sectors) fdisk’s unit size now happens to be an SSD’s maximum erase block size. Nice!

To make fdisk use 32 heads and 32 sectors, remove all partitions from a hard drive and then launch fdisk with the following command line when you create the first partition:

fdisk -S 32 -H 32 /dev/sda

The OCZ post also recommends starting at the second 512-cylinder unit because the first partition is otherwise shifted by one track. Don’t ask me why :)

Here’s how I partitioned my SSD in the end:

Screenshot of a linux console where fdisk reports 32 heads and 32 sectors

For a normal hard drive, I’d probably use 128 heads and 32 tracks now to achieve 4 KiB boundaries for my partitions.

RAID Chunk Size

If you plan on running a software RAID array, I’ve seen chunk sizes of 64 KiB and 128 KiB being recommended. This can be specified using the --chunk parameter for mdadm, eg.

mdadm --create /dev/md3 --level=1 --chunk=128 --raid-devices=2 /dev/sda3 /dev/sdb3

Probably the larger chunk size is more useful if you are storing large files on the RAID partition, but I haven’t found any advice which included benchmarks or at least a solid explanation yet.

File System Alignment

Now that the partitions have been taken care of, the file systems need to use proper alignment as well. Generally all file systems use some kind of allocation blocks, usually with a size of 4 KiB. But increasing this size to 128 KiB (or even 512 KiB) would waste a lot of space since any file would use up memory in a multiple of that number.

Luckily, Linux file systems can be tweaked a lot. I’m using ext4, here the -E stride,stripe-width parameters control the alignment. The HowTos/Disk Optimization page in the CentOS wiki gives this advice:

The drive calculation works like this: You divide the chunk size by the block size for one spindle/drive only. This gives you your stride size. Then you take the stride size, and multiply it by the number of data-bearing disks in the RAID array. This gives you the stripe width to use when formatting the volume. This can be a little complex, so some examples are listed below.

For example if you have 4 drives in RAID5 and it is using 64K chunks and given a 4K file system block size. The stride size is calculated for the one disk by (chunk size / block size), (64K/4K) which gives 16K. While the stripe width for RAID5 is 1 disk less, so we have 3 data-bearing disks out of the 4 in this RAID5 group, which gives us (number of data-bearing drives * stride size), (3*16K) gives you a stripe width of 48K.

The Linux Kernel RAID wiki offers further insight:

Calculation

  • chunk size = 128kB (set by mdadm cmd, see chunk size advise above)
  • block size = 4kB (recommended for large files, and most of time)
  • stride = chunk / block = 128kB / 4k = 32kB
  • stripe-width = stride * ( (n disks in raid5) – 1 ) = 32kB * ( (3) – 1 ) = 32kB * 2 = 64kB

If the chunk-size is 128 kB, it means, that 128 kB of consecutive data will reside on one disk. If we want to build an ext2 filesystem with 4 kB block-size, we realize that there will be 32 filesystem blocks in one array chunk.

stripe-width=64 is calculated by multiplying the stride=32 value with the number of data disks in the array.

A raid5 with n disks has n-1 data disks, one being reserved for parity. (Note: the mke2fs man page incorrectly states n+1; this is a known bug in the man-page docs that is now fixed.) A raid10 (1+0) with n disks is actually a raid 0 of n/2 raid1 subarrays with 2 disks each.

So these are the stride and stripe-width parameters I’d use:

  • Intel SSDs with an erase block size of 128 (or 512 KiB — Intel isn’t quite straightforward with this, see the comments section for a discussion on the subject – if anyone from Intel is reading this, help us out! ;-) ) that are not part of a software RAID:
    -E stride=32,stripe-width=32

  • OCZ Vertex SSDs with an erase block size of 512 KiB that are not part of a software RAID:
    -E stride=128,stripe-width=128

  • Normal hard drives that are not part of a software RAID
    trust the defaults

  • Any software RAID:
    -E stride=raid chunk size / file system block size,stripe-width=raid chunk size x number of data bearing disks

Thus, I set up the file systems on the Intel SSD like this:

mkfs.ext4 -b 1024 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1
mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sda3

mkfs.ext4 defaulted to 1024 byte allocation units on my boot partition, so I adjusted the stride up to 128 KiB according to the advice from the CentOS wiki. The alignment of my boot partition is probably not of any relevance because the system will read maybe 10 files from it and not modify anything, but I wanted to stay consistent :)

39 comments

1 ping

  1. adbge says:

    Beautifully written article — very helpful.

  2. trx says:

    great text!

    [quote]
    The OCZ post also recommends starting at the second 512-cylinder unit because the first partition is otherwise shifted by one track. Don’t ask me why :)
    [/quote]

    Because the first sector (512bytes) is Master Boot Record and cannot be part of the first partition.

  3. trx says:

    btw, can you please copy-paste output of print command in your fdisk after starting it with:
    fdisk /dev/sda -u -c

  4. cygon says:

    Ah, so that’s the reason. Thanks!

    I don’t know what -c does (and neither does my fdisk ;-)), but here’s fdisk -u -l /dev/sda for you:

       Device Boot      Start         End      Blocks   Id  System
    /dev/sda1   *        1024      132095       65536   83  Linux
    /dev/sda2          132096    16909311     8388608   82  Linux swap / Solaris
    /dev/sda3        16909312   151127039    67108864   83  Linux
    
  5. trx says:

    well, @fdisk
    [quote]Usage:
    fdisk [options] change partition table
    fdisk [options] -l list partition table(s)
    fdisk -s give partition size(s) in blocks

    Options:
    -b sector size (512, 1024, 2048 or 4096)
    -c switch off DOS-compatible mode
    -h print help
    -u give sizes in sectors instead of cylinders
    -v print version
    -C specify the number of cylinders
    -H specify the number of heads
    -S specify the number of sectors per track
    [/quote]

    and, AFAIK, Intel based SSDs use 128 blocks of 4KB as errase block size, so it should be 512kB, just as OCZ:
    http://www.anandtech.com/show/2614/3
    http://www.xbitlabs.com/articles/storage/display/intel-x25m-ssd_2.html

    please, correct me if I’m wrong.

  6. cygon says:

    My fdisk doesn’t support ‘-c’ – a quick google for some man pages also don’t reveal any such parameter (eg. [url]http://linux.die.net/man/8/fdisk[/url]). No idea why – running busybox 1.15.3 from January 27, 2010 – I guess it must have been removed.

    I’m quite sure that the erase block size on Intel drives is 128 KB, not 512 KB. It’s one of the features used by Intel to market their SSDs as superior to others (check this: [url]http://techreport.com/articles.x/15433[/url]) – they say the smaller erase block size reduces write overhead, thereby increasing the drive’s longevity.

    I believe the first article you linked got it a bit wrong. The graph clearly says 4 KB write requires a 128 KB erase (= write amplification of 32), but then the text from the article incorrectly states that 16 KB write would require a 512 KB erase, assuming the write amplification to be a static property of the drive. The write amplification in this case would be 8. And for a 128 KB write, it would be 1.

  7. trx says:

    I’ve asked on few forums, even on Intel’s community forum, contacted Intel support, but got no clear answer.

    This is closest so far:
    http://forums.anandtech.com/showthread.php?t=2069082

  8. Samat Jain says:

    [quote]
    Any software RAID:
    –stride=raid chunk size
    –stripe-width=raid chunk size x number of data bearing disks
    [/quote]

    This is wrong, or at least misleading from what you said earlier.

    Chunk size is typically reported in bytes (or kilobytes). For example, a 64 KiB chunk size.

    stride is the number of blocks in a chunk size. For a block size of 4 KiB, this is 64/4 = 16.

    Likewise for stripe-width, it is the number of blocks in a stripe width. This is 16*number of data disks.

    I’ve been trying to add correct information the Linux RAID wiki: https://raid.wiki.kernel.org/index.php/RAID_setup which should be considered the authoritative source on the subject.

  9. clesch says:

    Great article, but I think I noticed an error:

    mkfs.ext4 -b 4096 -E stride=32 -E stripe-width=32 /dev/sda3

    will give you 0 blocks stride with 32 stripe width as can be observed by the output on screen after entering the command.

    The correct command I needed to enter in order to get 32 stride with 32 stripe width is:

    mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sda3

  10. cygon says:

    @trx Thanks! I’ve dropped Intel an email, too. Hopefully one of us will get a straight answer eventually :)

    @Samat Jain: Wow, I din’t even didn’t know the kernel team had a wiki for the raid modules. Thanks for the link. I fixed the formula and also quoted the relevant section from the kernel raid wiki.

    @clesch: Whoops. That’s unexpected. I changed the commands to read the way you’re using (and the kernel raid wiki does, too). Thanks for the clarification!

  11. nym says:

    So, after doing this step:
    fdisk -S 32 -H 32 /dev/sda

    how did you decide on which cylinder a partition like sda2 ends and therefore sda3 begins. I mean, does it matter? Can you chose any size you like?

  12. cygon says:

    You can choose any size of want. With 32 sectors and 32 heads, 1 fdisk unit will be 512 KiB, so any representable position is aligned.

    I chose powers of 2 on my rig just for the sake of it, eg. sda2 is from unit 130 to 16513 (starting and ending unit inclusive), which makes it 16384 units, times 512 KiB is exactly 8 GiB (twice my RAM).

    My 80 GB SSD is partitioned to a 64 MiB boot, 8 GiB swap and a 64 GiB root partition, leaving 2.47 GiB unused with no particular reason other than having nice numbers :)

  13. nym says:

    thanks man!

  14. fecus says:

    I would like to install W7 and ubuntu 10.04 in the same SSD.
    First w7 will do his align in the half of SSD. It will be aligned correctly.
    What will happen during ubuntu install? Will the second partition be aligned correctly? How can i check it?

  15. cygon says:

    I haven’t tried two OSs on an SSD yet, but once Windows 7 has set up its partitions, you can guarantee that the Ubuntu partition(s) are aligned by creating them yourself through fdisk as described in this article.

    I’m sure the Ubuntu installer will let you somehow shell out to the console so you can set up your patitions yourself and then install into the existing partitions.

    If fdisk displays a warning that partitions don’t start on a cylinder boundary, Windows has placed partitions so they start at unaligned offsets. In that case, halve the value of either -S or -H and choose units in steps of 2. If it still prints the warning, halve again and choose units in steps of 4. And so on…

  16. fecus says:

    Thanks.
    I tried it with the OSs built in logic.
    First W7 and second Ubuntu.
    Is there any way to check the Ubuntu do his job correctly?

  17. cygon says:

    Yes, do what I described before. Start fdisk with the -S and -H parameters as shown in the article, print the partition list and see if any warning is displayed about partitions not aligned to a cylinder boundary.

    If there is a warning, the partitions it is displayed for are not aligned.

  18. JJ says:

    Sorry in advance for the very long comment.

    I believe stride and stripe-width are measured in blocks, not KiB — the KiB units should divide and cancel. The stride would seem to simply be the chunk size in blocks, with the stripe-width being the total size of the RAID stripes in blocks.

    Everything I can find (that hasn’t seemingly been sourced from this initial article) suggests they’re only useful for RAID arrays, and are solely used to tell the filesystem how many parallel reads/writes it can do in what sort of chunks. You should be able to figure out how much parallelism it can have by dividing the stripe-width by the stride.

    When you tell the filesystem that it has a stride of 32 and a stripe-width of 32, it should still only end up doing one block at a time — no parallelism, exactly like a ‘normal’ filesystem. If it has 64 blocks to write, it’ll write block one, block two, block three, and so on. A RAID 0 array, in contrast, with a stride of 32 and a stripe-width of 64 will write blocks 1 and 33, then 2 and 34, then 3 and 35, and so on.

    In short, I’m pretty sure setting the stride and stripe-width to the same value ‘cancel out’ each other, and do not add any stride-sized caching (or similar) to the filesystem. Someone with more knowledge about the RAID subsystem and ext2/3/4 filesystems should be able to either confirm my suspicions or correct me if I’m wrong, though.

    The only disk geometry-related settings that should be affecting SSD performance, as far as I can tell, is the partition alignment being a multiple of the erase block size (meaning it’s best to align at either 512K or 1024K, depending on your desire for future-proofing vs. having a few extra kilobytes available), and the block size of the partitions (best to set to 4K currently — if you need higher, anything that is evenly divisible by 4K and divides evenly into your erase block size should work).

    Mind you, I’m not thinking about SSDs in RAID arrays here — if you’re doing that, I’d imagine making your chunk size equal to (or a multiple of) your erase block size would be the most effective.

  19. cygon says:

    stride and stripe-width are most certainly measured in blocks. Both of the quoted paragraphs from the CentOS wiki and the kernel raid wiki divide the stride value by the file system’s block size (though the kernel raid wiki says “32 kB” whereas it should say 32 blocks – the math is right).

    You say that everything you can find suggests that –stride,strip-width is only useful for RAID arrays. Could you provide some links?

    As far as parallelism is concerned, stride and stripe-width might cancel each other out, but the intention here is to have the file system align its data structures to certain offsets, only that instead of optimizing a RAID controller/softraid it now optimizes accesses to the SSD.

  20. JJ says:

    Actually, after more research, it turns out I’m somewhere between partially and mostly wrong about stripe-width and stride. Sorry.

    The mkfs.ext4 manpage says, for stride, “This mostly affects placement of filesystem metadata like bitmaps at mke2fs time to avoid placing them on a single disk, which can hurt performance. It may also be used by the block allocator.” Sounds like it only affects things by placing data across multiple disks for speedy access. It doesn’t look like setting it *hurts* anything, though — that “may be used by the block allocator” does sound somewhat ominous, after all.

    I was also entirely wrong about the stripe-width having no effect. Setting it to the number of blocks per erase block seems like the ‘best’ idea (or possibly just 512K — I’m pretty sure the Intel MLC drives are also 512K). The part I missed about this was how the filesystem tries its best to avoid writing data to a partially-filled ‘stripe’, avoiding the read-modify-write that can occur in both RAID arrays (RAID 5 only, I think) and SSDs.

    I think, due to TRIM and the ‘smart’ firmware in some SSDs (particularly Intel), this will all only make a significant difference when the SSD is almost entirely full, at which point the filesystem’s block allocations may become a lot more important.

    There also appears to be a mount option to do this (partially?) for an already-created filesystem — stripe=N (where N is, again, a number of blocks). I don’t know if it has any further effects, or is simply redundant when stripe-width is set at mkfs time.

    I’m sorry for any confusion caused by my horribly incorrect comment. I think my head was spinning a little too much at the time I wrote it, from all the RAID and ext4 docs/manpages I had read. I was only thinking about writing to the disk, and not thinking about rewrites or block allocation.

  21. Erik Franzén says:

    After reading your artice I decided to give it a try.

    I have bought two 80 GB Intel X25-M SSD for my home server. The plan is to install Ubuntu 10.04 64-bit server and use the SSDs as system discs and vmware data storage using software raid for redundancy.

    After reading the blog post I am not sure how to make ALL my partitions aligned and set up on EBS (Erase Boundary Size)

    I am planning for four partitions:

    Boot, size 1GB
    Root, size 25GB
    Swap, size 4GB
    Data storage for vmware server, size 40GB

    According to the article I should use 32 heads and 32 sectors.

    Using the live CD, I started using fdisk -S 32 -H 32 /dev/sda

    Fdisk can create partitions using cylinders or sectors, and now I ran into trouble.

    First partition /boot must start on cylinder 2 (or sector 1024). Size is 1 GB and the following partition should be aligned and start on a new EBS block. How do I do this with fdisk?

    Should the next partition start on a new cylinder? Otherwise, after formatting, fdisk gives a warning that the partition is not aligned to the cylinder size?

    The overall question is how to format four aligned partitions which all are aligned with Intels X25-M EBS.

  22. Celox says:

    hi interesting articel!
    what would be the optimal blocksize for one partition with ext4 that includes boot partition and data (mostly small data) for best performance? should i go with 1024 or 4096 :o

  23. Jeff says:

    I forgot to mention, I said “out of the box” however, I did add code to modprobe vboxnetadp. Basically dup’d some code there already.

    …just being OCD ;-)

  24. Seth Baker says:

    Thanks for the guide. I’ve been trying to deep understanding of this alignment stuff for while. I have a 16 GB (SLC flash) SSD that I partitioned with fdisk with the given here. It is used as a portable tech toolbox rather than for specific computer, but that is another story.
    I partitioned it into two partitions, I set the first one start on cylinder 2 and end on cylinder 16384, it was a nice base 2 number also a multiple of 1014 and about half of the SSD’s size. The second partition used up the rest of the partitions space. So the first partion was a 8.5 GB NTFS partition and the second a 7.5 GB partition.
    So I checked the drive with Paragon Software’s Alignment Tool. It said that the first ntfs partition was aligned and the second linux partition was aligned. I told it to align the partition. I then booted back into linux and checked the partition with fdisk with the guide’s options. I saw that it adjusted the start of the first partition (ntfs) to start on cylinder 5 instead of 2. (2.1 MB free space before partition now) Did Paragon Alignment tool do anything useful? Or maybe it is better to start on cylinder 5

  25. smax says:

    Hey,
    [quote](64K/4K) which gives 16K[/quote]
    [quote]stride = chunk / block = 128kB / 4k = 32kB[/quote]
    Both wikis authors must be shirking school, don’t quote them blindly.

  26. cygon says:

    I think “quoting them blindly” is the best and only thing to do in this place.

    If a book discussed different opinions of experts, would you want it to edit any quotes from the experts so they conform with the opinion of the book’s writer? Nope, the writer should present his own opinion and give the reader a chance to follow his reasoning by quoting the source materials he draws his conclusions from.

    Apart from that, the second formula is actually well-formed (the divisor omits the ‘B’), so it’s 128,000 B / 4,000 = 32,000 B. Well, except that the result would be 32, I guess :)

    Erm…

  27. cygon says:

    I’m just setting up a new (non-SSD) hard drive and had to revisit my notes on alignment and stuff because this is probably using a sector size of 4K.

    This article was extremely helpful:
    http://randomtechoutburst.blogspot.com/2010/03/4k-alignment-for-disks-important.html

    It also mentions to start fdisk with -c (disable DOS compatibility) and that newer fdisk versions automatically align to 2048 bytes (at least for the first partition).

  28. Alexander Ofen says:

    First of all! Great article (really helpful)!

    Question: in the part of the article of the partition aligning it says: [quote]“[...o] OCZ forums with 32 heads and 32 sectors which means fdisk uses a cylinder size of 1024 bytes. And because fdisk partitions in units of 512 cylinders (= 512 x heads x sectors) fdisk’s unit size now happens to be an SSD’s maximum erase block size. Nice! ”
    [/quote]
    Am I correctly assumint that fdisk is not having “units of 512 cylinders” but rather uses 512 bytes per sector which leads to the result?
    After all you correctly explained that according to the given “sectors and heads” layout the number of cylinders is of course calculated by the capacity/size of the drive, right?

  29. Vasily Anonimov says:

    224 heads and 56 sectors:

    http://www.linuxfoundation.org/news-media/blogs/browse/2009/02/aligning-filesystems-ssd%E2%80%99s-erase-block-size

    >However, with SSD’s (remember SSD’s? This is a blog post about SSD’s…) you need to align partitions on at least 128k boundaries for maximum efficiency. The best way to do this that I’ve found is to use 224 (32*7) heads and 56 (8*7) sectors/track. This results in 12544 (or 256*49) sectors/cylinder, so that each cylinder is 49*128k. You can do this by doing starting fdisk with the following options when first partitioning the SSD:

  30. Sean says:

    I’ve read countless articles on this topic and this is by far the most elegant one I’ve come across. Unfortunately I don’t have the ability to reformat my SSD from scratch at the moment, so I can’t take advantage of this yet, but I just wanted to thank you for writing this in a way that’s fairly easy to understand. Bookmarked for future reference!

  31. Jamie Kitson says:

    Three letters: GPT.

  32. Gordan says:

    I suspect you’ll find that setting -E stride= parameter on SSDs is at worst counter-productive and at best pointless. This is used to make sure that metadata is spread evenly across disks. In the case of a single SSD this will in some cases lead to metadata not being in the same erase block as the data it is referring to, this requiring two erase blocks to be written to. Stick to using just the -E stripe-width= option on SSDs.

  33. cygon says:

    Any links on that?

    I’m no file system guru, but I thought metadata is usually not mixed with actual file data, but kept in a central place (the inode table in ext*fs) – except for very small files which are stored inside the inodes themselves to reduce seek times.

    Thus, either the metadata entry should contain the file data in the same block or they’re in two distant locations anyway, no matter the alignment.

  34. Tobias Brox says:

    Lots of tuning here … but no testing? If the tuning cannot be tested or doesn’t cause any measurable effects then it seems a bit wasteful.

  35. cygon says:

    I’m only explaining the reasoning and meaning of the various options that can be tweaked – with the goal to avoid the worst case scenario of requiring two SSD block erasures for any write done by the file system.

    Actual performance tuning is left as an exercise for the reader! :D

  36. cameel says:

    Why would the partition alignment have any effect on the operation of the SSD? Doesn’t the controller have some internal LBA -> page address mapping (to be able to freely rearrange the data and use wear leveling) that makes the whole point moot?

  37. Cygon says:

    Because the controller/firmware also works in blocks.

    If it was able to freely rearrange data, the worst case scenario would be somewhere around having lots of byte-sized chunks at random places in the SSD’s memory. To store the location of each byte-sized chunk in a table, that table would have to be around 5 times as large as the SSD (assuming a 40 bit number for the position of each byte). Or to just record where each byte went in a data structure like a B-tree, traversing that data structure would take forever.

    To remain fast and yield consistent performance, an SSD’s firmware only moves data around at the erase block level. That way, an 80 GiB SSD has only 163840 blocks and the remapping table can be stored in a fixed-size 24 bit address table (taking up half a megabyte). Maybe double that amount for the SSD’s wear leveling to store the number of writes that have happened to each block.

  38. Vahid Pazirandeh says:

    Great article, thank you. :)

    And to understand CHS addressing a bit better I went to: http://en.wikipedia.org/wiki/Cylinder-head-sector

  39. Vahid Pazirandeh says:

    Might I add that I highly recommend using the Linux utility “parted” instead of “fdisk”. Parted is just way more flexible, and it made it much easier to see exactly which sector or byte offset I was starting my partitions at, instead of compensating for fdisk’s CHS notation. Once you go parted you don’t go back. Believe me. :-)

    Here is a quick example of partitioning with parted, assuming you’ve typed “parted /dev/sda” on a disk with no partitions, and with an erasure block size of 2MiB. Someone please correct me if I’m wrong:

    unit B
    mkpart primary 2MiB 100%
    p

    My version of fdisk (2.20.1) no longer defaults to units of cylinders, like that which is shown in this blog post, but uses units of sectors instead (I have to type “unit” to switch to cylinder view). Also, when I “fdisk -H 32 S 32 /dev/sda” and make a partition, I can’t choose a starting sector size smaller than 2048 (which is exactly the beginning of the third cylinder, given S=32, H=32: 32 * 32 * 2 = 2048). So fdisk still partitions on cylinder boundaries, but it’s not very flexible in choosing exactly where you want to start. Parted just lets you specify start points in terms of bytes, which is what we’re really after anyway.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Please copy the string snKZ3P to the field below:

Social Widgets powered by AB-WebLog.com.