fbpx

Magento

Huge Magento database and cleaning it

by , , revisited on


We have by far the largest RPM repository with NGINX module packages and VMODs for Varnish. If you want to install NGINX, Varnish, and lots of useful performance/security software with smooth yum upgrades for production use, this is the repository for you.
Active subscription is required.

You may be wondering why your Magento database is so big. It is quite common to have huge Magento database exceeding 1 GB in size. But when its size grows near to 5 GB, you have to start asking your self questions. Does my store has so many products in catalog or there’s something wrong with the database? Let’s find out.

First, let’s find out which Magento database tables are taking most of the space. Connect to your favorite MySQL client (worst case, PhpMyAdmin 🙂 ) and run queries below:

Find largest tables in all databases

SELECT table_schema as "Database", table_name AS "Table", 
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS "Size (MB)" 
FROM information_schema.TABLES 
WHERE (data_length + index_length) > 1000000 
ORDER BY table_schema, (data_length + index_length) DESC;

This lists tables from all databases, which exceed 1 MB, sorted by size.

Find the largest tables in a specific database

If you’re interested in a specific database only, you can run:

SELECT table_name AS "Table",
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS "Size (MB)"
FROM information_schema.TABLES
WHERE (data_length + index_length) > 1000000  AND table_schema = "magento" 
ORDER BY (data_length + index_length) DESC; 

… where magento is your Magento database name,

You will see a sorted list like the following:

sales_flat_quote_address    2761.75
sales_flat_quote    2400.58
report_viewed_product_index 566.08
sales_flat_order_item   457.58
customer_address_entity_varchar 201.39
customer_entity_varchar 190.44
customer_entity_int 165.33
sales_flat_quote_item_option    116.14

Solutions below are for Magento 1.x only

Huge catalogsearch_query table

This table stores frequent searches on your website. Truncating is most safe to be done with the following query:

DELETE FROM  `catalogsearch_query` WHERE  `synonym_for` IS NULL AND  `redirect` IS NULL` 

This leaves out manually added synonyms and redirects.

Huge core_cache table

My observation is similar to the following:

Amasty Full Page Cache uses its own Table called core_cache. This database will bloat as visitors visit the website. It is a big flaw with the software. Our database increased from 300Mb up to 8.5GB which resulted in slow database transactions (i.e. add to cart, login, etc), and we have a website with approx 1,000 products.

So it’s best to remove it and use

Huge sales_flat_quote and sales_flat_quote_address tables

As you can see from the output, sales_flat_quote_address and sales_flat_quote each taking up whopping 2+ GB, totalling 5 GB.
Mostly those tables contain abandoned cart data. Now, do we really need to waste 5 GB of space for that?

And worse, if you’re performance oriented like myself, you know that you have to allocate MySQL InnoDB buffer pool that matches your data in size. So do we also need to waste 5 GB RAM for abandoned cart data buffers?

Of course, no. So let’s reduce the size, save some space and gain some performance.

Quick fix

Now there’s a time for simple one time solution:

DELETE FROM sales_flat_quote WHERE updated_at < DATE_SUB(Now(),INTERVAL 60 DAY);

This will remove data for quotes which were not converted within 60 days. No need to run extra query for sales_flat_quote_address. Since Magento uses databases triggers in its database schemata, the related tables data will be automatically removed from dependent tables.

Long term solution #1

Query fixed the problem for some time. But people are still visiting your store and some of them will surely enough abandon the website. The problem persists. So a better approach would be to regularly cleanup those tables via cron. A plugin comes to the rescue. Meet Aoe Quote Cleaner.

Long term solution #2

A better approach as it does not require use of yet another plugin, is leverage MySQL event schedule.

First you have to enable it in my.cnf:

[mysqld]
event_scheduler=ON

Do not restart MySQL to apply this change as you can apply it dynamically via:

SET GLOBAL event_scheduler = ON;

Then create an actual schedule to run daily cleanup of records which are older than 60 days:

CREATE EVENT IF NOT EXISTS clean_sales_flat_quote 
ON SCHEDULE EVERY 1 DAY
DO
  DELETE FROM sales_flat_quote 
  WHERE updated_at < DATE_SUB(Now(),INTERVAL 60 DAY);

Huge dataflow_batch_export table

This table is populated with entries when running a data export. If you running exports programmatically, e.g. via cron, make sure to add up cleaning routing to your script.

A sample script that does it can be found in a Gist.

If you are not running an export job right now, but still the table is big, truncating it is safe.

Huge core_session table (core_session.ibd)

This table is used when you store Magento sessions in MySQL backend. This is the worst place you can store them.
First, it’s only needed for multi-server setup. And second, for multi-server setup you better off using Redis.

To clean up the huge core_session table after switching to a different backend, you can:

n98-magerun db:query "TRUNCATE TABLE core_session"

Huge cron_schedule table

This table rarely gets large, but it happens that after disabling of a plugin, its cron tasks are still being scheduled and never cleaned up. Remove the plugin completely and cleanup all but recent scheduled cron tasks:

DELETE FROM cron_schedule WHERE scheduled_at < NOW() - INTERVAL 2 DAY;

Huge report_viewed_product_index table

Do you really need that data?

This table stores data about viewed products: who viewed which product. This is useful for showing blocks like “Recently viewed products”. Another use case is gathering stats about most viewed products. If you don’t have these use cases and not planning to implement them, it is safe to truncate the table daily.

In that way, you will not waste performance for useless data.

While you could have set a daily cron running a TRUNCATE query against the table in question, there’s a better and more elegant approach.

You can make use of Blackhole store engine. It acts like /dev/null for database writes. So all the unnecessary I/O and disk space will be saved:

ALTER TABLE report_viewed_product_index ENGINE=blackhole;

You can also disable log events (untested).

But if you need that data…

However, if you do need that data, there are some more interesting insights about it. Knowing these will allow you to weed out the large chunks of unneeded records:

Typically the report_viewed_product_index records would have:

  • visitor_id = NULL when it is a customer, and customer_id is not NULL.
  • customer_id = NULL when it is not a logged in customer, and visitor_id is not NULL.

The case where both are NULL is not an easy one.

The vistor_id is retrieved from the Mage_Log_Model_Visitor class.

The only possibility is that it is a visitor that has an excluded user agent, which means that is a bot (Google Bot, etc.).

As you can see from the constructor, the request logging is skipped in that case:

protected function _construct()
{
    $this->_init('log/visitor');
    $userAgent = $this->_httpHelper->getHttpUserAgent();
    $ignoreAgents = $this->_config->getNode('global/ignore_user_agents');
    if ($ignoreAgents) {
        $ignoreAgents = $ignoreAgents->asArray();
        if (in_array($userAgent, $ignoreAgents)) {
            $this->_skipRequestLogging = true;
        }
    }
    if ($this->_logCondition->isLogDisabled()) {
        $this->_skipRequestLogging = true;
    }
}

The ignored user agents are defined under app/code/core/Mage/Log/etc/config.xml and are the following:

  • Googlebot/1.0 (googlebot@googlebot.com http://googlebot.com/)
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Finally, the case where both are not NULL

It happens when:

  • logged out customer visits a product page
  • leaves the website
  • comes back and logs in as a customer
  • revisits the same product page

In that case, Magento will keep both values.

The visitor data is later set to NULL when the customer logs out with the following code:

public function purgeVisitorByCustomer(Mage_Reports_Model_Product_Index_Abstract $object)
{
    /**
     * Do nothing if customer not logged in
     */
    if (!$object->getCustomerId()) {
        return $this;
    }

    $bind   = array('visitor_id'      => null);
    $where  = array('customer_id = ?' => (int)$object->getCustomerId());
    $this->_getWriteAdapter()->update($this->getMainTable(), $bind, $where);

    return $this;
}

What about the daily cron that does log cleaning? Inside app/code/core/Mage/Reports/Model/Resource/Product/Index/Abstract.php you can locate the clean function:

/**
 * Clean index (visitor)
 *
 * @return Mage_Reports_Model_Resource_Product_Index_Abstract
 */
public function clean()
{
    while (true) {
        $select = $this->_getReadAdapter()->select()
            ->from(array('main_table' => $this->getMainTable()), array($this->getIdFieldName()))
            ->joinLeft(
                array('visitor_table' => $this->getTable('log/visitor')),
                'main_table.visitor_id = visitor_table.visitor_id',
                array())
            ->where('main_table.visitor_id > ?', 0)
            ->where('visitor_table.visitor_id IS NULL')
            ->limit(100);
        $indexIds = $this->_getReadAdapter()->fetchCol($select);

        if (!$indexIds) {
            break;
        }

        $this->_getWriteAdapter()->delete(
            $this->getMainTable(),
            $this->_getWriteAdapter()->quoteInto($this->getIdFieldName() . ' IN(?)', $indexIds)
        );
    }
    return $this;
}

It:

  • Cleans only those records in report_viewed_product_index where visitor_id isn’t null and the corresponding entry in log_visitor no longer exists
  • Does not clean any records where visitor_id is NULL (presumably this data is useful because it has to contain positive customer_id). This gives you historical data about what your customers have been looking at before or after their purchase.
  • It fails to clean the records where both visitor_id and customer_id are NULL. This is what we will take care about, because this is truly a bug that Magento won’t clean this.

Even if you had setup Magento 1.9 cron properly and enabled log cleaning in Magento settings, the table report_viewed_product_index is not cleaned properly. Let’s take care of this:

@daily /usr/bin/n98-magerun db:query "DELETE FROM report_viewed_product_index  WHERE visitor_id IS NULL and customer_id IS NULL" > /dev/null

Finally, if you want to go hardcore and trim and truncate all you want, read this article about all the tables you can truncate without affecting your store function.

To get information about the sizes of log tables in Magento, use:

php -f shell/log.php status

Huge log_visitor_info table

The table log_visitor_info may grow too large especially if the Magento setting Save Log, Days is set too high, resulting in too many entries/data.

This table can also grow as a result of a DoS attack.

Ensure that the setting Save Log, Days (Configuration -> System) is set to no more than 90 days.

Subsequently, apply cleaning for reducing existing data after changing the setting’s value:

n98-magerun sys:cron:run log_clean

Magento is by far the most widely used Ecommerce self-hosted platform. But it needs very thorough approach when it comes to performance. One important factor that affects Magento’s performance is its database size. Be sure to check our full ultimate Magento performance checklist for all the tips and tricks on getting Magento to run as fast as possible.

Or save yourself lots of time and let me configure and setup Magento server for you.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.