MultiSite Robots

Sybre Waaijer on June 17, 2015

Where to get started?

I didn’t really know where to start with my first Code Snippet. But I guess this is as good as an example as any other.

I’ll elaborate: I’ve been working on a MultiSite network for over a year now and I’ve encountered many awesome stuff you can do with WordPress. And so I did.

This is followed by a problem: When I make a plugin for my own MultiSite, I don’t ususally think about “how are user’s going to use this plugin?”, because I use it. No one else.
The users of my MultiSite might catch up on it, but most of the time they don’t interact with the settings I make through these plugins.

So there we have it. Although I have plugins like AutoDescription and Pro Mime Types… which both have user interfaces; most of my plugins don’t have interfaces.

So this is where I leave them.

The plugin

So you want Google and Bing to notice you have a sitemap? But you don’t have time to set up webmaster tools? Of course you don’t. Not as a user on my site.
So I’ll solve it for you. With robots!

WordPress has a nice little filter that manipulates the robots.txt output named robots_txt. We just have to return a value and we’ve created a whole new robots.txt file.

But this won’t fire if you have a robots.txt file in the root of your website… so go ahead and delete it! (make a backup, just in case ;)).

The code

<?php
/**
 * Plugin Name: MS Robots
 * Plugin URI: https://hostmijnpagina.nl/
 * Description: Automatically edits robots.txt for each blog. Includes sitemaps url, sitemaps need to be generated (AFAIK) by other plugins. Caches the output.
 * Version: 1.0.0
 * Author: Sybre Waaijer
 * Author URI: https://cyberwire.nl
 * License: GPLv2 or later
 */

/**
 * Filters the robots_txt function of WP Core
 *
 * @since 1.0.0
 */
function hmpl_msrobots($output, $public) {
	
	$blog_id = get_current_blog_id();
	
	$output = wp_cache_get('msrobots_' . $blog_id, 'msrobots' );
	if ( false === $output ) {
		$site_url = parse_url( site_url() );
		$path = ( !empty( $site_url['path'] ) ) ? $site_url['path'] : '';
		
		$output = "User-agent: *\n";
		
		//* If the blog isn't public, disallow everything.
		$public = get_option('blog_public');	
		if ('0' == $public) {
			$output .= "Disallow: /\n";
		} else {
			//* Output defaults
			$output .= "Disallow: $path/wp-admin/\r\n";
			$output .= "Disallow: $path/wp-includes/\r\n";
			
			//* Add our own
			$output .= "Disallow: $path/wp-login.php\r\n";
			$output .= "Disallow: $path/wp-activate.php\r\n";
			
			// Prevents query caching
			$output .= "Disallow: $path/*?*\r\n";
		}
		
		//* Add whitespace
		$output .= "\r\n";
		
		//* Add sitemap full url
		$scheme = ( !empty( $site_url['scheme'] ) ) ? $site_url['scheme'] . '://' : '';
		$host = ( !empty( $site_url['host'] ) ) ? $site_url['host'] : '';
		$output .= "Sitemap: $scheme$host/sitemap.xml\r\n";
		
		wp_cache_set('msrobots_' . $blog_id , $output, 'msrobots', 86400 ); // 24 hours
	}
	
	return $output;	
}
add_filter( 'robots_txt', 'hmpl_msrobots' );

I think pretty much everything is commented correctly already, but let me explain from scratch.

First, we take the original output from WordPress, which is listed below.

	$output = "User-agent: *\n";
	$public = get_option( 'blog_public' );
	if ( '0' == $public ) {
		$output .= "Disallow: /\n";
	} else {
		$site_url = parse_url( site_url() );
		$path = ( !empty( $site_url['path'] ) ) ? $site_url['path'] : '';
		$output .= "Disallow: $path/wp-admin/\n";
	}

As you can see, it outputs Disallows the root path / if the blog is set to private. Else it will generate the site path and disallows wp-admin by default for every robot.

And that’s pretty much it. And we want more! So we add some lines to what we want to allow and disallow.

$output = "User-agent: *\n";
		
//* If the blog isn't public, disallow everything.
$public = get_option('blog_public');	
if ('0' == $public) {
	$output .= "Disallow: /\n";
} else {
	//* Output defaults
	$output .= "Disallow: $path/wp-admin/\r\n";
	$output .= "Disallow: $path/wp-includes/\r\n";
	
	//* Add our own
	$output .= "Disallow: $path/wp-login.php\r\n";
	$output .= "Disallow: $path/wp-activate.php\r\n";
	
	// Prevents query caching
	$output .= "Disallow: $path/*?*\r\n";
}

As you might have noticed we also have taken out the $site_url variable from the if statement and place it before. This is because we need it outside of the if statement as well, for the sitemap.

So, we want to tell the robots.txt visitor where the sitemap is. We do this by creating a full URL, with https scheme ($scheme) and the hostname ($host). The output will be everything combined, the final result will be like this:

Sitemap: https://cyberwire.nl/sitemap.xml
//* Add sitemap full url
$scheme = ( !empty( $site_url['scheme'] ) ) ? $site_url['scheme'] . '://' : '';
$host = ( !empty( $site_url['host'] ) ) ? $site_url['host'] : '';
$output .= "Sitemap: $scheme$host/sitemap.xml\r\n";

Now about the wp_cache_get and wp_cache_set, I’ll be more elaborate on it elsewhere but it simply goes like this. Let me comment everything below.

//* Fetch a unique value, the blog id
$blog_id = get_current_blog_id();

//* Get the cache value and store it in $output
$output = wp_cache_get('msrobots_' . $blog_id, 'msrobots' );

//* if wp_cache_get has nothing, it will return false. So we need to generate it.
if ( false === $output ) {
	//* Let's fill in our variable with stuff
	$output = 'stuff';

	//* At the end of this statement, let's store it in the cache, using the same name with unique ID: msrobots_ . $blog_id
	// Don't worry too much about the cache group, it can be left empty. The time can also be left empty.
	wp_cache_set('msrobots_' . $blog_id , $output, 'msrobots', 86400 ); // 24 hours
}

//* Finally, let's use or return the stored variable.
return $output;

That’s it!

Well then, I hope you’ve learned something here :) It’s my first (short) tutorial on a piece of code I have created.
If you think I can do better (or worse), please let me know! I think the comment section’s working :)

Leave a Reply

Your email address will not be published. Required fields are marked *