crawlers » AETHYX MEDIAE – Independent Semantic Online Publishing

Published 28. October 2024 by the aethyx staff

Dear readers and customers,

starting this month, we added a new feature to all our web projects: basically we are blocking all AI crawlers. Or at least we try to.

It’s not said that it works or the crawlers will respect our manually integrated rules. However, at least we tried our best.

We do this for two main reasons:

1) you should be in control of your posts and thus your data. If your individual posts are used for alteration, you should know beforehand. Currently this is not given with the methodologies machine learning tools are trained. These just use what they can find on the open web

2) if your work helps in any way for monetisation of individual companies, you should get your portion. Our idea is let’s be fair: 50/50. For every Euro earned with your hard work, you should get at least 50 Cents

Here is the current list of crawlers we try to block as of now:

AI2Bot	Explores sites for web content that is used to train open language models	More Info
AmazonBot	Used by Amazon’s Alexa AI to provide AI answers.	More Info
AppleBot	Used by Apple for generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.	More Info
Bytespider	Used by TikTok for AI training.	More Info
Cohere	Used by Cohere to scrape data for AI training.	More Info
ChatGPT	Used by OpenAI to power ChatGPT.	More Info
ClaudeBot and Claude-Web	Used by Anthropic’s Claude.	More Info
CommonCrawl	Compiles datasets used to train AI models.	More Info
Diffbot	Used by Diffbot to scrape data for AI training.	More Info
FacebookBot	Used by Meta (Facebook) for their AI.	More Info
Friendly Crawler	Crawls websites to build datasets for machine learning experiments.	More Info
Google Extended	Used by Google to power Gemini (formerly known as Bard).	More Info
ImagesiftBot	Used by Hive’s Imagesift tool that scrapes images. This may be used for the company’s generative AI product.	More Info
Kangaroo Bot	Used to power the Australia-focused Kangaroo LLM.	More Info
Meta-ExternalAgent / Meta-ExternalFetcher	Used by Meta (Facebook) to train AI products.	More Info
OAI-SearchBot	Used by OpenAI for their SearchGPT product.	More Info
Omgilibot	Used by Omigili to scrape data for AI training.	More Info
PerplexityBot	Used by Perplexity for their AI products.	More Info
Scrapy	Blocks the Scrapy bot (used for scraping websites).	More Info
SentiBot	Blocks SentiOne’s AI-powered social media listening and analysis tools.	More Info
Timpibot	Used by Timpi; likely for their Wilson AI Product.	More Info
Webzio	Used by Webz.io for their social listening and intelligence platforms.	More Info
Webzio-Extended	Used by Webz.io for AI training.	More Info
YouBot	Used by You.com to train AI products.	More Info

If you are already a customer (thank you!), we activated it automatically on your website for free. There is no additional cost and there never will be.

If you want to join as a new happy customer, the feature is added automatically when we set up your site. The information is already up to date on the product overview page: https://aethyx.eu/eshop/.

Sorry for the inconvenience. In an ideal world this would never have happened. However we are far from ideal at the moment. Let’s look into the future as things can only get better from here.

Enjoy fall and best wishes,
the aethyx staff

Posted in AI/ML, News Tags: AI, blocking, crawlers Leave a comment