4G 左右的单行文件加载到 Spark

我正在尝试加载一个单行文件，整个文件中没有新的行包，因此技术单行大小是文件的大小。我尝试使用下面的代码来加载数据。

val data= spark.sparkContext.textFile("location")

data.count

它无法返回任何值。

尝试使用以下代码将文件作为字符串读取，尝试用 java 代码编写。

import org.apache.hadoop.conf.Configuration

import org.apache.hadoop.fs.Path

import org.apache.hadoop.fs.FileSystem

val inputPath = new Path("File")

val conf = spark.sparkContext.hadoopConfiguration

val fs = FileSystem.get(conf)

val inputStream = fs.open(inputPath)

import java.io.{BufferedReader, InputStreamReader}

val readLines = new BufferedReader(new InputStreamReader(inputStream)).readLine()

JVM 正在退出并出现以下错误。

ava HotSpot(TM) 64 位服务器虚拟机警告：INFO: os::commit_memory(0x00007fcb6ba00000, 2148532224, 0) 失败；error='无法分配内存' (errno=12)

Java 运行时环境没有足够的内存来继续。本机内存分配 (mmap) 未能映射 2148532224 字节以提交保留内存。

问题是整个数据都在单行中，使用 \n 来识别新记录（新行）。因为有 \n 它试图加载到产生内存问题的单行中

我可以根据长度拆分那个长字符串，为每 200 个字符（0,200）的第一行添加换行符。(200,400) 是第二行。

样本输入

This is Achyuth This is ychyath This is Mansoor ... .... this line size is more than 4 gigs.

输出

This is Achyuth

This is ychyath

This is Mansoor

森林海

浏览 170回答 2

2回答

一只斗牛犬

如果文件大小是拆分大小的倍数并且字符编码是固定长度的（ASCII、UTF-16、UTF-32、UTF-8 中没有高于 127 的代码点或类似...），则此方法有效。给定文件This is AchyuthThis is ychyathThis is Mansoorval rdd = spark  .sparkContext  .binaryRecords(path, 15)  .map(bytes => new String(bytes))val df = spark.createDataset(rdd)df.show()输出：+---------------+|          value|+---------------+|This is Achyuth||This is ychyath||This is Mansoor|+---------------+

智慧大石

Spark 没有为文本文件设置 EOL 分隔符的选项。对我来说处理这个问题的最好方法是在 spark 中使用 Setting textinputformat.record.delimiter你会得到很多选择。

随时随地看视频慕课网APP