Magento Images to Woocommerce

Migrating Product Images from Magento to Woocommerce

A Python Script to Webscrape Magento Images into Product Named Folders :

There are various plugins available to migrate products from Magento across to Woocommerce that work with varying degrees of success and usually offering both free and paid versions. The free versions, at least as far as I have found, do not include the ability to migrate the product images across and the paid versions might seem a bit much just to move your images for a one off migration if you are not a large company.

First thoughts are  it will just be a case of copying the images from the magento media directory and over to the wordpress uploads directory and working out some way of reassigning the images to the products but the way magento stores images complicates things a little.

When you upload a product image in Magento it stores it a cache directory structure that is not very user friendly that typically looks something like this:

/media/catalog/product/cache/1/image/350x/9df78eab33525d08d6e5fb8d27136e95/w/o/imagename.jpg

This presents us with the problem that if you download the entire image cache directory is it very difficult and time consuming to track down which image is assigned to which product plus the resulting directory structure is something you wouldnt really want to be uploading to the wordpress installation anyway. It is far more ideal to have all the images in a easier, human readable structure.

With this in mind I set out to write a webscraper in python to get all the images from the Magento site and store them in nicely structured and named format. I wanted to achieve the following:

  • Obtain a list of the products and product URLS  that were in the sitemap – ensuring no duplicates
  • Download every product image and save it in a folder named after the product url minus the domain name
  • Save each image as product-name-url-X.jpg  – where X was an incremented number starting from 1

I used Beautiful Soup 4 to facilitate the webscraping and urllib for the actual downloading

Firstly I wrote a function to scrape the sitemap page to retrieve all the product Urls, add them to a list if not already stored (to avoid duplication) and then return the list object.

My sitemap page has the urls stored in an unordered list with each <li> tag having the class=”product” assigned , so I scraped all the child <a hrefs=””> contained within those list items

def get_product_url_list():
    productlist = []
    # Set URL for sitemap to parse
    sitemap = 'https://www.MYDOMAIN.com/sitemap'
    # Connect to the SITEMAP
    response = requests.get(sitemap)
    # Parse HTML and save to BeautifulSoup object¶
    soup = BeautifulSoup(response.text, "html.parser")
    # Loop through soupsitemap object finding 'li.a' with class 'product'
    for link in soup.find_all('li', class_="product"):
        producturl = (link.a.get('href'))
        if producturl not in productlist:
            productlist.append(producturl)
    return productlist

Next I wanted to create the folders on my local machine to store the images. Each folder I decided to call after the product url minus the domain.
My site is set to use short product urls and not include the category in the url structure as this has caused some SEO issues in the past.
The code could be improved by calculating the character length of the domain section but for brevity I simply entered the length 27 manually.
Product names are returned in a productname list and are of the format ‘my-product-name’ (my site urls have had the .php stripped already) :

def get_product_name():  # removes the https://www.MYDOMAIN.com/ section
    productname = []
    for product in get_product_url_list():
        productname.append(product[27:]) 
    return productname

I then created a function that creates the directories, if they do not exist already,  by calling the get_product_name() to get the directory name to be used :

def create_product_dirs(productname):
    directory = f"./images/{productname}"
    if not os.path.exists(directory):
        os.makedirs(directory)

Then the function for downloading the images takes the productname as an argument which is then used to set the url to scrape from, the directory to save to and the image name to be saved :

def get_product_images(urlproductname):
    # Set the URL you want to webscrape from
    url = f"https://www.MYDOMAIN.com/{urlproductname}"
    response = requests.get(url)  # Connect to the URL
    # Parse HTML and save to BeautifulSoup object¶
    soup = BeautifulSoup(response.text, "html.parser")
    count = 0
    # Loop through soup object finding 'a' links with class 'lightbox' -  check what class is assigned to gallery images in your theme
    for link in soup.find_all('a', class_="lightbox"):
        imgurl = (link.get('href'))
        count += 1
        imgname = f"./images/{urlproductname}/{urlproductname}-{count}.jpg"
        urllib.request.urlretrieve(imgurl, imgname)
        # time.sleep(1)  # uncomment this to pause the code for a second if you have issues with your host blocking you due to the scraping

N.B Some hosts might detect a scraping script like this as something malicious and block or throttle your IP . If you wish to avoid this or be careful you can uncomment the time.sleep(1) line to add a pause inbetween downloads. It will of cause take longer to download everything that way but it is useful to avoid problems.

The whole Python script is here is a Gist  – comments  & questions welcome.

Next I will be working on a script to upload the images to woocommerce and assign them to the correct products.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.