How To Read Large Excel File Using Apache POI

Introduction

Apache POI is the most popular library when comes to manipulate word, excel, ppt based documents. In this example I am going to show you how to read large excel file using Apache POI library. Your large excel file may contain various kind of information, such as, Human Resource data, Sales data, Credit Cards, Bank Transactions, etc. The large excel file may have at least 50 columns and 100k (100000) rows.

When you are loading the large excel file in memory during reading then you probably would get java.lang.OutOfMemoryError: Java heap space. One way to fix the issue by increasing the heap size in JVM by setting the -Xmx argument, but if you are using Java 8 then you don’t need to setup heap size (automatically gets calculated) because in Java 8 onward version, the PermGen space was replaced by Meta space.

Why are you getting OutOfMemoryError? Processing Excel files in POI using the UserModel code is DOM (Document Object Model) based, so the whole file (including parsed form) needs to be buffered into memory and that’s why you are getting OutOfMemoryError, because the large file does not fit into the available memory.

Also check if you are using File instead of FileInputStream, because using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file. So FileInputStream needs more memory than File.

Let’s consider the following example to read a large excel sheet:

Workbook wb = WorkbookFactory.create(new File("Sales-Records.xlsx"));

XSSFSheet sheet = (XSSFSheet) wb.getSheetAt(0);

for (Row r : sheet) {
	for (Cell c : r) {
		System.out.println(c.getStringCellValue());
	}
}

The excel file I am reading using the above code snippets has a size of 120MB. Though I am using XSSF event model, still I got the following error while I am running the code:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Next I will try to use XSSF and SAX event API to resolve the issue “out of memory” while loading the large excel file in memory.

Note the version of Java should be at least 8 and the version of Apache POI has been used is 5.0.0.

read large excel file using apache poi

XSSF and SAX (Event API)

Using the common SS (SpreadSheet) interfaces, you can code for both HSSF (.xls) HSSFWorkbook and XSSF (.xlsx) XSSFWorkbook basically transparently. However, it needs lots of memory.

The SAX (Simple API for XML) based event API only processes part of the file at a time, so it needs much less memory while processing XML document. However, it requires more work from your side, which is why you might be better to allocate a few more GB of memory at the problem.

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

One important thing to note with the basic Event API is that it triggers events only for things actually stored within the file. With the XLSX file format, it is quite common for things that have yet to be edited to simply not exist in the file. This means there may well be apparent “gaps” in the record stream, which you need to work around.

To use this API you construct an instance of org.apache.poi.xssf.eventmodel.XSSFReader. This will optionally provide a nice interface on the shared strings table, and the styles. It provides methods to get the raw xml data from the rest of the file, which you will then pass to SAX.

The following code shows how to read the large excel file:

public class SaxEventUserModel {

	public void processSheets(String filename) throws Exception {
		OPCPackage pkg = OPCPackage.open(filename);
		XSSFReader r = new XSSFReader(pkg);
		SharedStringsTable sst = r.getSharedStringsTable();
		XMLReader parser = fetchSheetParser(sst);
		Iterator<InputStream> sheets = r.getSheetsData();
		while (sheets.hasNext()) {
			System.out.println("Processing new sheet:\n");
			InputStream sheet = sheets.next();
			InputSource sheetSource = new InputSource(sheet);
			parser.parse(sheetSource);
			sheet.close();
			System.out.println("");
		}
	}

	public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
		XMLReader parser = XMLHelper.newXMLReader();
		ContentHandler handler = new SheetHandler(sst);
		parser.setContentHandler(handler);
		return parser;
	}

	/**
	 * See org.xml.sax.helpers.DefaultHandler javadocs
	 */
	private static class SheetHandler extends DefaultHandler {
		private SharedStringsTable sst;
		private String lastContents;
		private boolean nextIsString;

		private SheetHandler(SharedStringsTable sst) {
			this.sst = sst;
		}

		@Override
		public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
			// c => cell
			if (name.equals("c")) {
				// Print the cell reference
				System.out.print(attributes.getValue("r") + " - ");
				// Figure out if the value is an index in the SST
				String cellType = attributes.getValue("t");
				if (cellType != null && cellType.equals("s")) {
					nextIsString = true;
				} else {
					nextIsString = false;
				}
			}
			// Clear contents cache
			lastContents = "";
		}

		@Override
		public void endElement(String uri, String localName, String name) throws SAXException {
			// Process the last contents as required.
			// Do now, as characters() may be called more than once
			if (nextIsString) {
				int idx = Integer.parseInt(lastContents);
				lastContents = sst.getItemAt(idx).getString();
				nextIsString = false;
			}
			// v => contents of a cell
			// Output after we've seen the string contents
			if (name.equals("v")) {
				System.out.println(lastContents);
			}
		}

		@Override
		public void characters(char[] ch, int start, int length) {
			lastContents += new String(ch, start, length);
		}
	}

}

The whole source code with sample large excel file can be downloaded from the section Source Code later in this tutorial.

Maven Dependency

The maven version 3.8.1 and Apache POI 5.0.0 have been used for this example:

<dependencies>
	<dependency>
		<groupId>org.apache.poi</groupId>
		<artifactId>poi-ooxml</artifactId>
		<version>5.0.0</version>
		<exclusions>
			<exclusion>
				<groupId>xml-apis</groupId>
				<artifactId>xml-apis</artifactId>
			</exclusion>
		</exclusions>
	</dependency>
</dependencies>

You need to exclude xml-apis from the Apache POI otherwise you will get the following error:

The package javax.xml.parsers is accessible from more than one module: <unnamed>, java.xml

Source Code

Download

Leave a Reply

Your email address will not be published. Required fields are marked *