crawl more domains

Exchange insights, tools, and strategies for canada dataset.
Post Reply
rochon.a1.119
Posts: 419
Joined: Thu Dec 26, 2024 3:23 am

crawl more domains

Post by rochon.a1.119 »

What did we do?

To optimize the queue, we added filters that prioritize unique content and high-authority websites, and protect the queue from click farms. As a result, the system is now able to find more unique content and generate reports with fewer duplicate links.

Here are some of the key elements of how it works:

To protect our queue from click farms, we check if a high number of domains are coming from the same IP address. If we see a lot of domains coming from the same IP, their priority in the queue will drop. This allows us to from different IPs.
To protect websites and avoid polluting our reports with similar links, we check to see if there are too many URLs from the same domain. If we see too many URLs coming from the same domain, they will not be crawled on the same day.
To ensure that we crawl the most recent pages as soon as possible, any URLs that we haven't crawled before will be given higher priority.
Each page has its own hash code which helps us prioritize crawling unique content.
We take into account the frequency of new links generated on the source page.
We take into account the authority score of a page and a domain.
How the queue has been improved

With more than 10 factors they filter out unnecessary links.
With more exclusive and high-quality pages thanks to new quality hong kong mobile database control algorithms.
Trackers
Our crawlers follow internal and external links on the Internet looking for new pages with backlinks. Therefore, we can only find a page if there is an incoming link to it.

In reviewing our previous system, we saw an opportunity to increase overall crawlability and find better content—content that website owners would want us to crawl and index.

What did we do?

We tripled our number of traces (from 10 to 30)
We stop crawling pages with URL parameters that do not affect the content of the page (&sessionid, UTM, etc.)
We increased the frequency of reading robots.txt files on websites
How trackers have been improved

More trackers (now 30!)
Data cleaning without links or duplicates
Better localization of the most relevant content
Crawling speed of 20 billion pages per day
Storage
Storage is where all the links you can see as a Semrush user are stored. This storage shows you the links in the tool and gives you filters you can apply to find what you are looking for.

The main concern we had with our old storage system was that it could only be completely rewritten after the upgrade. Meaning that every 2-3 weeks, it had to be rewritten and the process started again.

Thus, during the update, new links accumulated in the temporary storage, creating a delay in the visibility of the tool for users. We wanted to improve the speed at this stage.

What did we do?

To improve, we rewrote the architecture from scratch. To eliminate the need for temporary storage, we increased our number of servers by more than four times. 400%.

It took over 30,000 engineering hours to implement the latest technologies. Now, we have a scalable system that has no limits.

How storage has been improved

More than 500 servers in total
287TB RAM
16,128 CPU cores
30PB total storage space
Filtering and reporting at lightning speed
INSTANT UPDATE - no more buffering
Backlinks Database Study
We conducted a two-part study to compare the speed of our Backlink Analysis tool with Moz, Ahrefs, and Majestic.

To see exactly how fast our tool is compared to other SEO alternatives on the market, check out this article.

We are very proud of our new Backlink Analysis database and want everyone to experience it.

Get access to a free trial by creating a Semrush account, browse the Backlink Analysis section available to you.

Try it and let us know what you think!

Welcome to the future of dynamic backlink management!
Post Reply