I guess I over estimated DFS replication. I was migrating a huge production share from a single drive on a PC over to the two servers. In the new setup, each server had a 1 TB Western Digital Black hard drive dedicated to this folder of production data, and DFS Replication would be used to keep it duplicated and synced between the two servers.
Note: Instead of a NAS or just a mirrored array in one server, I wanted full hardware redundancy, and I wanted it abstracted through the DFS namespace so the users didn’t have to worry about it. This way, one of the servers could catch fire and melt into a puddle on the floor, and the users could still browse to the network share and access files. A single NAS or mirrored array is still one entity.
I waited for everyone to leave, and I did a straight copy from the old server to the new ‘server array’. This folder housed about 750 GB spread over more than 300,000 files. Everything seemed OK during the transfer, but when it was done, I had a huge mess. File services were throwing errors and warnings, LAN bandwidth was suffering, and due to my retardedness, the nightly backups initiated, taking all of the still-trying-to-be-replicated files with it. ( The nightly backup is managed by Windows Server snap-in, and it goes to an external ioSafe )
System performance was abysmal, and I went to rebooting the two servers ( I can’t remember if I told them to stop replicating that folder at this point or not ). Unknown to me, both of their network adapters, upon reboot, were hosed so communication between the two DCs was slow, users coming in early could barely log into the domain, and the DNS servers where AWOL. All of this seemingly because I flooded a replication folder without any plan of attack.
Solution:
Before anyting, I disabled / enabled the network adapters, and that fixed whatever happened to them. ( I know, wtf? Both servers had this same problem )
I scrapped the replication group to this production folder, I scrapped the physical files on the secondary server that were already replicated, I scrapped the namespace targets to the folder, and started over. ( During this time, I redirected the users back to the old “file server” )
New plan of attack:
- Make separate replication groups for each ‘project’ in the large folder. This will result in about 20 groups.
- Make a namespace folder that will house the 20 replication groups.
- Publish each replication group into the aforementioned namespace folder.
Server performance went back to happy levels. Each group began to replicate, and after 30 minutes or so, the secondary server had completed replication of all the groups.
This solution comes with some benefits:
- Now that each ‘project’ (around 50 GB each) is its own replication group, and the target of its own namespace location, I can spread the ‘projects’ around the server across multiple hard drives.
- File Services warning now have a little more meaning because the point to specific groups, and not just some 750 GB replication group. I can adjust staging files and such appropriately.