ALTom Soft Logo

 
Download Tips. Duplicate Files Elimination

Once you searched the web for, let's say, "white tigers" or "white tigers pictures" you got addresses of web sites containing (although not 100% guaranteed!) information about white tigers. Next step is downloading a number of files and web pages from found web sites and see what they really all about. Just specify the directory for saved files and choose Web | Download menu item.

"Favorites" and "Bad Files" Folders

It can take considerable amount of time to download thousands of web sites even if you set partial download options on and enable smart download feature. To speed up processing create a folder for unwanted (bad) files. As download proceeds view results in Image Viewer and move large unwanted files to "bad files" folder. Small files, banners, logos, etc. can be simply deleted without placing them to "bad files" folder. Then open Tools | Address Book dialog to block unwanted sites "on-the-fly".
On the other hand, when you find some interesting picture you can mark its source URL address as "favorite" to download such site apart from others (and prior to other sites).
Note: To enable files processing based on their source URL make sure that "Attach URL Markers to Files" box on Download tab is checked.

Blocking sites with unwanted content.

Some web sites can have unique descriptions and URL addresses, pretending to be totally different web sites. One site can have hundreds of different names (URLs) and each page has its own description, but the content is identical on all such pages! In result you get a large number of totally identical files stored on your hard drive! Fortunately, we can effectively get rid of such sites and their clones coming in future by blocking their IP addresses. Spammers can use hundreds of host names, but usually all of them point to a single domen hosted by a computer with just one IP address. If we block such IP we also block all web sites hosted there. Even if such sites have already been added to project database they are not processed.

To block annoying sites move all duplicate pictures to separate folder (let's call it "bad files" folder) and open Tools | Address Book dialog. Then browse for our "bad files" folder, set tolerance level (default value 3 is optimal in most cases) and press Block IPs button. It is simple, but use this trick only when you feel it is absolutely necessary. Today very many web sites have shared IP addresses (IP address is the same, differ only names of web sites) and it may cause problems - just imagine if spammer's web pages are hosted by some popular web hosting provider who also serves large number of normal sites and you block this IP address, then ALL web sites hosted by this provider will not be available for you! However, most often a spammer sites are hosted by their own machines and we can effectively block them. If you are not sure try to increase the tolerance level. For example, if you set tolerance level to 5 it means- "if 5 or more bad web sites have the same IP address - this IP address is blocked".

Sometimes duplicate files are saved from sites not looking like mirrors, but having identical content.

a) Spammer sites:
  • content is po**ographic, images size is usually small
  • you get a number of identical files, each of them is duplicated 5-50 times and more
To block such sites follow procedure described above.
b) Simply identical sites, but having different descriptions and different URL addresses:
  • content is anything, but rarely po**ographic
  • you get several identical files, each of them is duplicated 2-4 times, rare more
You can just ignore such sites and delete duplicate pictures or you can move all duplicate pictures to "bad files" folder, then open Address Book dialog, set tolerance level to 2 or more and press Block URL button.

previous page next page