I was writing a hadoop job which processes many files and creates multiple files from each file. I was using "MultipleOutputs" to write them. It worked fine for a small number of files but I was getting the following error for large number of files. I tried increasing the ulimit and -Xmx but to no avail.
2013-01-15 13:44:05,154 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hdfs.DFSOutputStream$Packet.(DFSOutputStream.java:201)
at org.apache.hadoop.hdfs.DFSOutputStream.writeChunk(DFSOutputStream.java:1423)
at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:161)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:136)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:125)
at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:116)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:90)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:54)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter. writeObject( TextOutputFormat.java:78)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter. write(TextOutputFormat.java:99)
**at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write( MultipleOutputs.java:386)
at com.demoapp.collector.MPReducer.reduce(MPReducer.java:298)
at com.demoapp.collector.MPReducer.reduce(MPReducer.java:28)**
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:595)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:433)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Solution:
I used the following configuration values to resolve it-
OPTS="-Dmapred.reduce.tasks=8 -Dio.sort.mb=640 -Dmapred.task.timeout=1200000"
hadoop jar ${JAR} ${OPTS} -src ${SRC} -dest ${DST}
2013-01-15 13:44:05,154 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hdfs.DFSOutputStream$Packet.
at org.apache.hadoop.hdfs.DFSOutputStream.writeChunk(DFSOutputStream.java:1423)
at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:161)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:136)
at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:125)
at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:116)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:90)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:54)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter. writeObject( TextOutputFormat.java:78)
at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter. write(TextOutputFormat.java:99)
**at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write( MultipleOutputs.java:386)
at com.demoapp.collector.MPReducer.reduce(MPReducer.java:298)
at com.demoapp.collector.MPReducer.reduce(MPReducer.java:28)**
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:595)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:433)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Solution:
I used the following configuration values to resolve it-
OPTS="-Dmapred.reduce.tasks=8 -Dio.sort.mb=640 -Dmapred.task.timeout=1200000"
hadoop jar ${JAR} ${OPTS} -src ${SRC} -dest ${DST}