Tuesday Tech Tip - Accelerating ZFS Workloads with NVMe Storage

No video

Tuesday Tech Tip - Accelerating ZFS Workloads with NVMe Storage

Рет қаралды 9,580

45Drives

Күн бұрын

Пікірлер: 46

@---tr9qg Жыл бұрын

🔥🔥🔥

@midnightwatchman1 Жыл бұрын

actually document management servers actually frequently have over 100 K files in one directory. massive workload do exist. a human may not do directly by but applications frequently do

@TheChadXperience909 Жыл бұрын

I'm sure it would benefit email servers, as well.

@n8c Жыл бұрын

A temperature tracking software (food transport industry) does this as well. Devs do crap like this all the time, where stuff clearly belongs in a DB or sth 😅

@HoshPak 18 күн бұрын

Would having a SLOG and a cache device still make sense when having a special vdev in the pool? I'm building a compact storage server that fits 64 GB of RAM, 2 NVMes and 4 HDDs. I could imagine partitioning the NVMe drives equally so I have everything mirrored plus a striped cache. Would that be useful? What is a good way to measure this?

@89tsupra Жыл бұрын

Thank you for the explanation, you mentioned that the metadata is stored on the disks and having an NVME will help speed that up. Would you recommend adding one for an all-flash storage pool?

@steveo6023 Жыл бұрын

As he said it will keep the load from the storage disks (or flash). Depending on the workload it also could improve performance for an all flash storage

@TheChadXperience909 Жыл бұрын

M.2 NVME drives have lower latencies than drives connected via SATA, and often have faster read/write throughput. It would accelerate such an array, but to a lesser extent. When comparing, you should look at their IOPS.

@shittubes Жыл бұрын

it can create higher fairness between multiple applications with different access patterns. so that a high throughput sequential write load won't affect another workload that does mostly very small I/O, either just on metadata or using small blocksizes (handled by the special device).

@ati4280 10 ай бұрын

It depends on SSD types. If you add a NVMe drive for all-flash SATA pool, the benifits will not as noticeable as accelarate a HDD only pool. The iops difference between NVMe drives and SATA drives is not that significant. The 4k performance of a SSD is not only related to its interface, model of NVMe controller, NAND type, and cache speed and size also play a big role to influence the final performance of a SSD drive.

@n8c Жыл бұрын

Do you usually run some performance metrics on your customers' machines once they have been built out? Feel like you could easily let the same tools run in the background to generate some exemplary "load at 10 am might be this", which should easily show the differences. For StarWind vSAN I used DiskSPD, which seems to have a linux-port Git repo (YT doesn't like links, it's the first result in Google).

@chrisparkin4989 Жыл бұрын

Great vid but won't 'hot' metadata live in your ARC (RAM) anyway and therefore that is surely the fastest place to have it?

@TheExard3k Жыл бұрын

It would. But ARC evicts stuff all the time, so your desired metadata may not stay there. Tuning parameters can help with this. But having metadata on SSD/NVMe guarantees fast access. And the vdev increases pool capacity, so it's not "wasted" space. Worth considering if you have spare capacity for 2xSSD/NVMe. And you really need it on very large pools or when handing out a lot of zvols (block storage).

@zparihar 7 ай бұрын

Great demo. What are the risks? Do we need to mirror the special device? What happens if it dies?

@meaga 5 ай бұрын

You would loose the data in your pool. So yes, I'd recommend to mirror your metadata device. Also make sure that it is sized correctly corresponding to your data vdevs size.

@pivot3india Жыл бұрын

is it good to have meta disk even if we use ZFS primarily as virtualisation target ?

@Solkre82 4 ай бұрын

If you add a metadata vdev to a pool, is it safe to remove later? Is this a cache or is no metadata going to disk anymore?

@45Drives 4 ай бұрын

the metadata vdev houses the data about the data. So things like properties, indexes, etc. essentially pointers to where the data is in the pool/dataset. If you remove that it's like the data has nothing to tie it to a specific block(s) in the pool rendering all data inaccessible So no, not safe to remove.

@StephenCunningham1 Жыл бұрын

Stinks that you lose the whole pool if the mirror dies. I'd want to also z2 the special pool

@TheChadXperience909 Жыл бұрын

In my experience, it really accelerates file transfers. Especially, when doing large backups of entire drives and file systems.

@steveo6023 Жыл бұрын

How can this improve transfer speed when only metadata are on the nvme?

@TheChadXperience909 Жыл бұрын

@@steveo6023 It speeds up, because flash storage is faster at small random IOPS than HDDs. Even though they are small reads/writes, they add up over time. Also, it prevents the read/write head inside the HDD from thrashing around as much, which reduces seek latency, and can also benefit drive longevity.

@steveo6023 Жыл бұрын

@@TheChadXperience909 but metadata is cached in the arc anyway

@TheChadXperience909 Жыл бұрын

@@steveo6023 That applies only to reads, and always depends.

@shittubes Жыл бұрын

@@steveo6023 If spinning drives can spend 99% of their time in sequential writes, they will be very fast. If e.g. 50% of the time is spent in random writes for metadata, the transfer speed will be halved. if the nvme metadata handling doesn't add other unexpected delays (which I do not know, am only wondering if it's the case), this could be completely predictable in this linear way.

@teagancollyer Жыл бұрын

Hi, what capacity HDD's and NVME were used for the video, I'm terrible at reading Linux's storage capacity counters. I'm trying to work out a good capacity of NVME to get for my 32TB (raw) pool, is 500GB a good amount?

@45Drives Жыл бұрын

We used 16TB HDDs and a 1.8TB NVMe. How much metadata stored will vary depending on how many files are in the pool not only how big it is. 32 TB of tiny files will use more metadata space than 32TB of larger files. So, its not always straightforward to pick the size of the special vdev needed. Okay, so where to go from here? Rule of thumb seems to be about 0.3% of the pool size for a typical workload. This is from Wendell at Level1Tech - a very trusted ZFS guru. See this as reference: forum.level1techs.com/t/zfs-metadata-special-device-z/159954 So, in your case 0.3% of 32TB would be 96GB. Therefore, 512GB NVMe will work. Remember, you will want to at least 2x mirror this drive and buy enterprise NVMe, as you will want power loss protection. If you already have data on the pool, you can get the total amount of metadata currently being used, using a tool called 'zdb'. Check out this thread as a reference: old.reddit.com/r/zfs/comments/sbxnkw/how_does_one_query_the_metadata_size_of_an/ You can do this by following the steps in the above thread or you can use a script we put together inspired by the thread: scripts.45drives.com/get_zpool_metadata.sh Usage "bash get_zpool_metadata.sh poolname" Thanks for the question, hope this helps!

@teagancollyer Жыл бұрын

@@45Drives Thanks for the reply. All of that info will be very useful and I'll be reading those threads you linked in a minute.

@cyberpunk9487 Жыл бұрын

Im curious does this benefit iscsi luns and vm disks. Say i want to use truenas as a storage target iscsi for windows vms and i would also like to use a SR (storage repo) for vm disks to live on.

@shittubes Жыл бұрын

it's only useful for datasets, not usable for zvol

@TheChadXperience909 Жыл бұрын

Metadata doesn't (only) mean file metadata, in this case. Zvols also consist of metadata nodes and data nodes, and the metadata nodes do get stored on the special vdev, as well. However, you'll likely see acceleration to a lesser degree than with regular datasets. Though, I read somewhere that you may be able to use file based extents for iSCSI, which means dataset rules would apply.

@cyberpunk9487 Жыл бұрын

@@TheChadXperience909 from what I remember you can use file extents for iscsi on truenas but I vagally recall hearing that some of the iscsi benefits are lost when not using zvols.

@TheChadXperience909 Жыл бұрын

@@cyberpunk9487 Makes sense.

@mitcHELLOworld Жыл бұрын

@@shittubes We actually don't use ZVOL's for iSCSI LUNs for a few reasons. We have found much better success deploying fileIO based LUNs that we create within the ZFS dataset. I believe one of my videos here goes over this, but perhaps its time for a good refresher on ZFS iSCSI.

@shittubes Жыл бұрын

i'm honestly quite disappointed that the speedup for nvme special device is quite a lot smaller in the larger folders: 500k 18/11 1.63636363636364 1k 119/21 5.66666666666667 the first examples were nice, 6x speedup - why not. but a 2x speedup, not so impressive any more, considering that nvme should normally be 10x faster even at the largest blocksizes. in the iostat output i also see the nvme being read at often just 5MB/s, why is it so low?!

@TheChadXperience909 Жыл бұрын

The law of diminishing returns.

@mitcHELLOworld Жыл бұрын

5MB/s isn't what matters here. The rated IOPS of a drive is what is going to tell you how fast storage media will run. For example, if your storage IO pipeline is using a 1 KB block size (it isn't but just as an example) then your storage media needs to be able to do 5000 IOPS to even hit 5MB/s, whereas if your block size was 1MB, 5000 IOPS would be 5GB/s. a HDD is capable of in the neighborhood of 400 IOPS total (thats being generous) , making a HDD not even able to hit 450KB/s if you were to use 1KB block size. As for The special VDEV, is what we consider a "support vdev" and is best when used in conjunction with the ARC. It isn't best used for ALL metadata requests to come from it. However, to easily show the difference between no NVMe and NVMe for special vdev he had to considerably handicap the ZFS ARC because during this test there is no real world workloads happening, and if he kept the ARC fully sized, you woudln't have seen a difference between the two anyways because the ARC would have held everything. In a production setting, there will be a large subset of the pools metadata stored within the RAM on the ARC, and the special VDEV will be there for any cache misses on metadata lookups that aren't in the ARC. When ZFS has a cache miss and there isn't a dedicated special vdev this can cause quite a bit of latency and slowdown. By adding the special VDEV in, this accelerates metadata lookups by a huge factor. The special VDEV can also be put to use for small block allocation as well which is really cool and can really improve performance of the overall pool. Perhaps we will cover this in more detail in a follow up video. But in the mean-time, Brett and I did discuss this in our "ZFS Architecture and build" video from a few months ago!

@shittubes Жыл бұрын

@@mitcHELLOworld do i understand correctly that you didn't start with an empty ARC? i can confirm that I often see something around 60-95% ARC hits here for metadata in production, even with a small ARC. this would indeed seem a good enough explanation why the ratios between HDD and nvme times aren't higher.

@shittubes Жыл бұрын

@@mitcHELLOworld what was your recordsize? not sure what block size is best for just metadata for such an edge-case. would be funny to dig deeper here and check the size of the actual read()s returning. i agree it's better to look at IOPS, old habits :D so revisiting, I look at the 1000k scenario, and concentrating just on the IOPS: the special device IOPS seems to peak somewhere at the beginning with around 19K, but averages ~7K IOPS. meanwhile, the HDDs (all together) don't do so much worse, averaging 5-7K IOPS. i feel like something else must be bottlenecking, not the actual IOPS capacity of the nvme drives. or do you consider 7K good? :P

@steveo6023 Жыл бұрын

Unfortunately it will add a single point of failure when using only one name device as all data will be gone when the metadata ssd dies

@TheChadXperience909 Жыл бұрын

That's why you should always add it in mirrors, which also has the effect of nearly doubling read speeds, since it can read from both mirrors. Presenter is using mirrors in his example.

@n8c Жыл бұрын

This is a lab env, as stated. You wouldn't run this exact setup prod for various reasons 😅