Parsing extremely large XML files in php

Posted on

Parsing extremely large XML files in php – Here in this article, we will share some of the most common and frequently asked about PHP problem in programming with detailed answers and code samples. There’s nothing quite so frustrating as being faced with PHP errors and being unable to figure out what is preventing your website from functioning as it should like php and mysql . If you have an existing PHP-based website or application that is experiencing performance issues, let’s get thinking about Parsing extremely large XML files in php.

I need to parse XML files of 40GB in size, and then normalize, and insert to a MySQL database. How much of the file I need to store in the database is not clear, neither do I know the XML structure.

Which parser should I use, and how would you go about doing this?

Solution :

In PHP, you can read in extreme large XML files with the XMLReaderDocs:

$reader = new XMLReader();

Extreme large XML files should be stored in a compressed format on disk. At least this makes sense as XML files have a high compression ratio. For example gzipped like large.xml.gz.

PHP supports that quite well with XMLReader via the compression wrappersDocs:

$xmlfile = 'compress.zlib://path/to/large.xml.gz';

$reader = new XMLReader();

The XMLReader allows you to operate on the current element “only”. That means it’s forward-only. If you need to keep parser state, you need to build it your own.

I often find it helpful to wrap the basic movements into a set of iterators that know how to operate on XMLReader like iterating through elements or child-elements only. You find this outlined in Parse XML with PHP and XMLReader.

See as well:

It would be nice to know what you actually intend to do with the XML. The way you parse it depends very much on the processing you need to carry out, as well as the size.

If this is a one-off task, then I’ve started in the past by discovering the XML structure before doing anything else. My DTDGenerator (see was written for this purpose a long time ago and still does the job, there are other tools available now but I don’t know whether they do streamed processing which is a prerequisite here.

You can write an application that processes the data using either a pull or push streamed parser (SAX or StAX). How easy this is depends on how much processing you have to do and how much state you have to maintain, which you haven’t told us. Alternatively you could try streamed XSLT processing, which is available in Saxon-EE.

Leave a Reply

Your email address will not be published. Required fields are marked *