Disk Safe Best Practices

The Server Backup Manager Disk Safe is highly reliable and robust, built with industrial grade protection from crashes and power loss. It has automatic mechanisms that protect terabytes of archived data from crashes and power failures not found in any other disk-to-disk backup software.

This is accomplished using an atomic write journal. Before any changes are made to a Disk Safe file, a new on-disk journal file is created. Before any Disk Safe pages are altered, the original content of the Disk Safe pages are written to the journal file. This allows any transaction to be completely rolled back, even if interrupted by a crash or power failure. The Disk Safe assumes that the operating system will buffer writes and that a write request will return before data has actually been written to disk. The Disk Safe also assumes that write operations will be reordered by the operating system. For this reason, the Disk Safe performs fsync() operation on Linux at key points.

Always use stable and reliable storage

Tips for selecting stable and reliable storage:

Only use NAS appliances that support NFS locking; this includes most commercial-grade NAS appliances, but not all.
If you are using hardware RAID, it is recommended to use a battery backup unit (BBU) for the RAID card or RAID system.
If you do not have a BBU for your RAID system, it is recommended to disable the write back cache. Consult the documentation for your RAID system.
IDE drives are not recommended. It has been reported many IDE drives ignore signals to flush their on-disk cache.
If using network attached block storage such as a: ISCSI or ATA-over-Ethernet, it is highly recommended you thoroughly test your storage environment for its ability to handle failures. In particular network level failures. It has been reported by our customers that these types of storage systems are the most prone to hardware and network faults.
Avoid budget RAID cards. It has been reported that frequently entry level RAID cards can have serious performance and reliability issues. At the bottom end are some budget RAID cards that do not even perform hardware RAID and instead are simply a SATA card with a software RAID driver.
USB and other portable media are great for temporary transport of a Disk Safe from one location to another. USB and other portable media should not be used for regular storage of Disk Safes. These types of inexpensive storage have been reported to ignore signals to flush their on-disk cache. In addition, their portable nature can cause disconnects or power failures simply by nudging a cable.

Only use USB drives to transport Disk Safes

Use USB drives to transport Disk Safes. For example, a USB drive is excellent for moving a Disk Safe from your office to your data center, or between two Server Backup Managers connected by a slow network connection. When using a USB drive, follow the steps to safely remove the device from your computer.

R1Soft does not recommend using a USB drive as the every day, primary storage for Disk Safes. USB drives tend to be unreliable and it has been reported that many USB drives ignore FlushFileBuffers() and fsync() requests to clear hard disk cache.

Storage best practices

We recommend the use of XFS or Ext4 for Disk Safe storage volumes.
XFS provides preallocation mount option (allocsize = 1g) to minimize fragmentation and online defragmentation.
If using Ext4, make sure you are using Ext4 extents. Extents help reduce file system fragmentation.
We recommend RAID 1, 10, or 1E whether hardware or software RAID. Parity RAID 5/6 is not well suited for high rates of random read/write performance required by a disk safe.
Avoid LVM for best performance.
Do not use Ext3.
- Storing Disk Safes on Ext3 can result in disk corruption. Add the mount option barrier=1 to /etc/fstab on Ext3 file systems where Disk Safes are stored, OR disable your storage controller's write cache. Ext3 does not check sum the journal. If barrier=1 is not enabled as a mount option (in /etc/fstab), and if the hardware is doing out-of-order write caching, you run the risk of severe file system corruption during a crash. For more information, click here.
- Storing Disk Safes on Ext3 can cause performance problems. The Disk Safe periodically (sometimes frequently) use fsync() to force changes to the file through to disk hardware to ensure data is protected from a power loss or crash. Ext3 has a well known broken implementation of fsync() causing a flush file buffers on one Disk Safe file to sync the entire file system to disk. This can have a severe performance penalty on Disk Safe I/O and writes to one Disk Safe can degrade reads of other Disk Safes due to Ext3's well known fsync() issues.
- If storing your Disk Safes on Ext3, it is recommended that only other Disk Safes are stored on the file system due to Ext3's problematic fsync() implementation. Do not store applications and daemons on a Ext3 file system with your Disk Safes if you can avoid it.
When using Ext3 or Ext4, make sure you accept the default block size (4 KB). Forcing smaller block sizes will decrease performance for the Disk Safe's large files and can limit max file size to as little as 16 GB, which will likely cause 'File too Large' errors during initial replicas.
You may not be able to get desired performance from network attached storage based on commodity devices/servers.
- If using NFS, it is recommended to use the latest stable NFS versions and latest stable Linux kernels.
- Never export NFS file systems with asynchronous writes enabled (async option). Exporting NFS with the async option can cause data corruption in the event of a failure.

Defragment file systems

The Server Backup Manager Disk Safe is a storage system for long-term archiving of unique block-level (units of data below the file system) deltas (small differences in data). SBM extends the Disk Safe files and makes write in predictable 32 KB increments. This helps the modern file systems pre-allocate space for the Disk Safe files and naturally reduce file fragmentation. As the Disk Safe block stores (.db files) for each device, you are protecting age. Some data inside those files remains forever and some is marked deleted and recycled for new deltas as old data is merged out over time. If there is no recycled free space inside of the file for new deltas, the file is extended on disk. This can happen anytime a new recovery point is created.

Generally, any files that remain on disk for a long period of time and continue to be written to and extended in size are subject to file system fragmentation. This is true for the R1Soft Disk Safe and other long-term storage mechanisms (for example, a relational database like MS SQL Server and MySQL).

For optimal performance, R1Soft recommends that you periodically defrag your file systems where Disk Safes are stored. If it is feasible in your environment, a weekly file system defrag is optimal.

XFS file system:

How to use XFS defrag.

Tutorial for XFS on CentOS.

Ext4 file system:

Ext4 has an experimental online defrag capability. Eventually, we all expect this to become stable for production servers.

Ext4 has extents which help reduce file fragmentation. Extents do not eliminate file system fragmentation and Ext4 can still benefit from defrag.

Be aware the consensus is ext4 online defrag is NOT ready for production systems.

Click here for a thread in ubuntu bugs about the topic of ext4 online defrag (look for I_KNOW_E4DEFRAG_MAY_DESTROY_MY_DATA_AND_WILL_DO_BACKUPS_FIRST).

Disk Safe Compaction:

Disk safe compaction can be performed on individual disk safes from the SBM UI. This will defragment a disk safe that is performing badly and reclaim free pages.

Offline Defragmentation:

It is possible to perform an offline defrag of any Linux file system. Due to the work involved, it is recommended to perform this task only once or twice a year.

Shut down your Backup Manager:

/etc/init.d/cdp-server stop
Add an intermediate disk available capable of holding all of your Disk Safes (could be network storage).
Copy all of your Disk Safes to the intermediate storage:

cp -af /disk/safes/* to /mnt/storage/
Re-format the file system that has the primary copy of your Disk Safe:

mkfs /dev/YOUR_DEVICE
Copy all of the Disk Safe files back:

cp -af /mnt/storage/* /disk/safes/

Start the Backup Manager:

 
/etc/init.d/cdp-server start
 

Note on Linux Defrag
There appears to be a widespread fallacy that somehow Linux file systems (ext2, ext3, and ext4) are magically immune to fragmentation. This could not be further from the truth. It's true that the kernel does a "pre-allocation" as you are writing to a file. It will attempt to notice that an application keeps extending a file and allocate contiguous blocks ahead of what it is doing. No matter how fantastical pre-allocation or the file system may be, files get fragmented and you need a way to pack them. Linux file systems are no exception.

When to vacuum Disk Safe

For best performance, vacuum your Disk Safes when you need to reclaim unused storage space. The Vacuum function reorders the data to reduce fragmentation. Once the vacuum is complete, the Disk Safe is populated with all Recovery and Archive points. This important maintenance can significantly improve restore performance and shrink the on-disk size of the Disk Safe.

For more about Disk Safe vacuum, see Vacuum Disk Safes.

If you are a service provider using the Server Backup Manager as a multi-tenant system, use Volume quotas based on Size of Deltas in Disk Safe instead of On Disk Size. This way your customers only pay for what is stored in the Disk Safe instead of the on-disk foot print which includes the unused parts of the Disk Safe file being recycled for future deltas.

Corruption of data archived in the Disk Safe

The Server Backup Manager Disk Safe is highly reliable and robust. Even with industrial grade protection, there are still ways for your data to become lost or damaged beyond repair. If any of the following events occur, you may corrupt your Disk Safe. If your Disk Safe becomes corrupted, you may lose all or some of your archived data.

If the SBM Disk Safe is corrupted by any one of the events below, it cannot be repaired.

Delete a file in the Disk Safe folder.
Make an incomplete copy of the Disk Safe folder, thereby corrupting the copy.
Making a copy of the Disk Safe files when the Disk Safe is open and being written to by Server Backup Manager, thereby corrupting the copy.
A hardware or O/S fault causing incorrect data to be written to Disk Safe files.
A faulty hard disk or storage controller failing to flush volatile cache when requested can break protection from unclean shutdowns and power failures.
Rogue process writing to the Disk Safe files.
Soft linking any files Inside of the Disk Safe folder. If the block deltas store and its associated write journal (created at run time) end up on different file systems data loss can occur if there is a crash or power failure.
Failure to store the Disk Safe on a journaling file system(XFS/Ext4) can cause the write journal to be lost or moved to lost+found. If this happens, the Disk Safe may likely become damaged beyond repair.
NFS (Linux / Unix Network File System) faults or bugs.
If using NFS on Linux, R1Soft recommends you use the latest available NFS versions and latest stable Linux kernels. Do not export NFS file systems with asynchronous writes (async option).

Data corruption in your environment

If your file system is corrupt at the time you replicated it, this does not mean your Disk Safe is corrupt. Instead it is corruption that was on your primary storage when Server Backup Manager replicated it to the Disk Safe. If this happens, the corrupt files will likely be unrecoverable.
Use caution if you are replicating a server that has file system or disk subsystem warnings or errors in the event or system logs. Be aware it is possible for a working server to have file system corruption. Key file system data structures may be damaged on disk and loaded in memory by the operating system, causing a corrupt file system to work until it is rebooted. If this happens, your files were corrupt when replicated and you may not be able to restore them from the Server Backup Manager replica.
It is possible for hardware or O/S faults to cause incorrect data to be read from the Disk Safe at the time of restore, even though the data on disk media is correct. In these cases, some or all of the archived data may appear damaged beyond repair until the fault is corrected. Determining if the fault is occurring only on read or if the data is damaged on media may be difficult or impossible. Furthermore, no amount of checking or validating of data in the Disk Safe can prevent or pre-warn these kinds of faults.