We got a great question from a potential customer today. The essence of the question was
[…] does Unitrends have the ability to “locate” the deduplicated files/paths as it runs so that a histogram or pie charts (or some other identifying method) can be found to identify and change the backup schedule for areas that obviously don’t need to be backed up often? Sort of makes sense, right? Why spin the CPU on the appliance backing up data that you already know is redundant? Saves time on backups, and helps users migrate off old data, or at least change the backup schedule so that it backs up once a month (or quarter) vs. once a week. […] Let me know if such a thing exists in general, and how Unitrends has (or will) address something that I would think is a really good selling point for the particular dedup software involved.
This is increasingly an issue as data gets colder (I touched previously upon the subject of data getting colder here.)
The “deduplication industry” doesn’t want to talk about this type of thing in general – because the answer to everything is “deduplication.” It’s like walking around with a hammer – everything starts to look like a nail. So the obvious solution if you’re in the “deduplication business” is to look at this and start talking about stuff like “source level deduplication” – so that you don’t backup files or blocks that already have been backed up. The trouble is that using source-level deduplication eats CPU cycles on the computers and storage being protected to compute comparison indices to decide not to backup the non-changing data.
What you want isn’t a deduplication solution; instead, you want your backup to be smart enough to avoid redundant backups of non-changing (cold) data. That’s where incremental forever comes into play. After the first master, you simply don’t backup data that doesn’t change. Period. No deduplication is involved – and better yet, no user adjustment or even awareness of backup schedules is involved – the system simply only backs up data that changes.
This is one of the reasons that the concept of an integrated all-in-one backup appliance is so superior to the functionally-partitioned backup server, backup software, and deduplication device that is used in much of the industry today. By integrating the backup architecture with deduplication, you can avoid forcing the user to have to be involved in making decisions about backup strategy based on temporal data change – instead, your backup appliance simply handles that in the most efficient way possible.
Now