Magento

Huge Magento database and cleaning it

by , , revisited on


We have by far the largest RPM repository with dynamic stable NGINX modules and VMODs for Varnish 4.1 and 6.0 LTS. If you want to install nginx, Varnish and lots of useful modules for them, this is your one stop repository to get all performance related software.
You have to maintain an active subscription in order to be able to use the repository!

You may be wondering why your Magento database is so big. It is quite common to have huge Magento database exceeding 1 GB in size. But when its size grows near to 5 GB, you have to start asking your self questions. Does my store has so many products in catalog or there’s something wrong with the database? Let’s find out.

Find largest tables in database

First, let’s find out which Magento database tables are taking most of the space. Connect to your favorite MySQL client (worst case, PhpMyAdmin 🙂 ) and run:

SELECT table_name AS "Table",
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS "Size (MB)"
FROM information_schema.TABLES
WHERE table_schema = "magento"
ORDER BY (data_length + index_length) DESC; 

… where magento is your Magento database name,

You will see a sorted list like the following:

sales_flat_quote_address    2761.75
sales_flat_quote    2400.58
report_viewed_product_index 566.08
sales_flat_order_item   457.58
customer_address_entity_varchar 201.39
customer_entity_varchar 190.44
customer_entity_int 165.33
sales_flat_quote_item_option    116.14

Solutions below are for Magento 1.x only

Huge core_cache table

My observation is similar to the following:

Amasty Full Page Cache uses its own Table called core_cache. This database will bloat as visitors visit the website. It is a big flaw with the software. Our database increased from 300Mb up to 8.5GB which resulted in slow database transactions (i.e. add to cart, login, etc), and we have a website with approx 1,000 products.

So it’s best to remove it and use

Huge sales_flat_quote and sales_flat_quote_address tables

As you can see from the output, sales_flat_quote_address and sales_flat_quote each taking up whopping 2+ GB, totalling 5 GB.
Mostly those tables contain abandoned cart data. Now, do we really need to waste 5 GB of space for that?

And worse, if you’re performance oriented like myself, you know that you have to allocate MySQL InnoDB buffer pool that matches your data in size. So do we also need to waste 5 GB RAM for abandoned cart data buffers?

Of course, no. So let’s reduce the size, save some space and gain some performance.

Quick fix

Now there’s a time for simple one time solution:

DELETE FROM sales_flat_quote WHERE updated_at < DATE_SUB(Now(),INTERVAL 60 DAY);

This will remove data for quotes which were not converted within 60 days. No need to run extra query for sales_flat_quote_address. Since Magento uses databases triggers in its database schemata, the related tables data will be automatically removed from dependent tables.

Long term solution #1

Query fixed the problem for some time. But people are still visiting your store and some of them will surely enough abandon the website. The problem persists. So a better approach would be to regularly cleanup those tables via cron. A plugin comes to the rescue. Meet Aoe Quote Cleaner.

Long term solution #2

A better approach as it does not require use of yet another plugin, is leverage MySQL event schedule.

First you have to enable it in my.cnf:

[mysqld]
event_scheduler=ON

Do not restart MySQL to apply this change as you can apply it dynamically via:

SET GLOBAL event_scheduler = ON;

Then create an actual schedule to run daily cleanup of records which are older than 60 days:

CREATE EVENT IF NOT EXISTS clean_sales_flat_quote 
ON SCHEDULE EVERY 1 DAY
DO
  DELETE FROM sales_flat_quote 
  WHERE updated_at < DATE_SUB(Now(),INTERVAL 60 DAY);

Huge core_session (core_session.ibd)

This table is used when you store Magento sessions in MySQL backend. This is the worst place you can store them.
First, it’s only needed for multi-server setup. And second, for multi-server setup you better off using Redis.

To clean up the huge core_session table after switching to a different backend, you can:

n98-magerun db:query "TRUNCATE TABLE core_session"

Huge cron_schedule table

This table rarely gets large, but it happens that after disabling of a plugin, its cron tasks are still being scheduled and never cleaned up. Remove the plugin completely and cleanup all but recent scheduled cron tasks:

DELETE FROM cron_schedule WHERE scheduled_at < NOW() - INTERVAL 2 DAY;

Huge report_viewed_product_index table

Do you really need that data?

This table stores data about viewed products: who viewed which product. This is useful for showing blocks like “Recently viewed products”. Another use case is gathering stats about most viewed products. If you don’t have these use cases and not planning to implement them, it is safe to truncate the table daily.

In that way, you will not waste performance for useless data.

While you could have set a daily cron running a TRUNCATE query against the table in question, there’s a better and more elegant approach.

You can make use of Blackhole store engine. It acts like /dev/null for database writes. So all the unnecessary I/O and disk space will be saved:

ALTER TABLE report_viewed_product_index ENGINE=blackhole;

You can also disable log events (untested).

But if you need that data…

However, if you do need that data, there are some more interesting insights about it. Knowing these will allow you to weed out the large chunks of unneeded records:

Typically the report_viewed_product_index records would have:

  • visitor_id = NULL when it is a customer, and customer_id is not NULL.
  • customer_id = NULL when it is not a logged in customer, and visitor_id is not NULL.

The case where both are NULL is not an easy one.

The vistor_id is retrieved from the Mage_Log_Model_Visitor class.

The only possibility is that it is a visitor that has an excluded user agent, which means that is a bot (Google Bot, etc.).

As you can see from the constructor, the request logging is skipped in that case:

protected function _construct()
{
    $this->_init('log/visitor');
    $userAgent = $this->_httpHelper->getHttpUserAgent();
    $ignoreAgents = $this->_config->getNode('global/ignore_user_agents');
    if ($ignoreAgents) {
        $ignoreAgents = $ignoreAgents->asArray();
        if (in_array($userAgent, $ignoreAgents)) {
            $this->_skipRequestLogging = true;
        }
    }
    if ($this->_logCondition->isLogDisabled()) {
        $this->_skipRequestLogging = true;
    }
}

The ignored user agents are defined under app/code/core/Mage/Log/etc/config.xml and are the following:

  • Googlebot/1.0 (googlebot@googlebot.com http://googlebot.com/)
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
  • Googlebot/2.1 (+http://www.googlebot.com/bot.html)

Finally, the case where both are not NULL

It happens when:

  • logged out customer visits a product page
  • leaves the website
  • comes back and logs in as a customer
  • revisits the same product page

In that case, Magento will keep both values.

The visitor data is later set to NULL when the customer logs out with the following code:

public function purgeVisitorByCustomer(Mage_Reports_Model_Product_Index_Abstract $object)
{
    /**
     * Do nothing if customer not logged in
     */
    if (!$object->getCustomerId()) {
        return $this;
    }

    $bind   = array('visitor_id'      => null);
    $where  = array('customer_id = ?' => (int)$object->getCustomerId());
    $this->_getWriteAdapter()->update($this->getMainTable(), $bind, $where);

    return $this;
}

What about the daily cron that does log cleaning? Inside app/code/core/Mage/Reports/Model/Resource/Product/Index/Abstract.php you can locate the clean function:

/**
 * Clean index (visitor)
 *
 * @return Mage_Reports_Model_Resource_Product_Index_Abstract
 */
public function clean()
{
    while (true) {
        $select = $this->_getReadAdapter()->select()
            ->from(array('main_table' => $this->getMainTable()), array($this->getIdFieldName()))
            ->joinLeft(
                array('visitor_table' => $this->getTable('log/visitor')),
                'main_table.visitor_id = visitor_table.visitor_id',
                array())
            ->where('main_table.visitor_id > ?', 0)
            ->where('visitor_table.visitor_id IS NULL')
            ->limit(100);
        $indexIds = $this->_getReadAdapter()->fetchCol($select);

        if (!$indexIds) {
            break;
        }

        $this->_getWriteAdapter()->delete(
            $this->getMainTable(),
            $this->_getWriteAdapter()->quoteInto($this->getIdFieldName() . ' IN(?)', $indexIds)
        );
    }
    return $this;
}

It:

  • Cleans only those records in report_viewed_product_index where visitor_id isn’t null and the corresponding entry in log_visitor no longer exists
  • Does not clean any records where visitor_id is NULL (presumably this data is useful because it has to contain positive customer_id). This gives you historical data about what your customers have been looking at before or after their purchase.
  • It fails to clean the records where both visitor_id and customer_id are NULL. This is what we will take care about, because this is truly a bug that Magento won’t clean this.

Even if you had setup Magento 1.9 cron properly and enabled log cleaning in Magento settings, the table report_viewed_product_index is not cleaned properly. Let’s take care of this:

@daily /usr/local/bin/n98-magerun db:query "DELETE FROM report_viewed_product_index  WHERE visitor_id IS NULL and customer_id IS NULL" > /dev/null

Finally, if you want to go hardcore and trim and truncate all you want, read this article about all the tables you can truncate without affecting your store function.

To get information about the sizes of log tables in Magento, use:

php -f shell/log.php status

Magento is by far most widely used Ecommerce self-hosted platform. But it needs very thorough approach when it comes to performance. One important factor that affects Magento performance is its database size. Be sure to check our full ultimate Magento performance checklist for all the tips and tricks on getting Magento to run as fast possible.

Or save yourself lots of time and let me configure and setup Magento server for you.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.