在 Apache Beam 上将 PCollection 作为侧面输入传递时发生 KeyError

我将side_inputPCollection 作为侧面输入传递给ParDo转换,但得到相同的 KeyError


import apache_beam as beam

from apache_beam.options.pipeline_options import PipelineOptions

from beam_nuggets.io import relational_db

from processors.appendcol import AppendCol

from side_inputs.config import sideinput_bq_config

from source.config import source_config



with beam.Pipeline(options=PipelineOptions()) as si:

  side_input = si | "Reading from BQ side input" >> relational_db.ReadFromDB(

    source_config=sideinput_bq_config,

    table_name='abc',

    query="SELECT * FROM abc"

  )


with beam.Pipeline(options=PipelineOptions()) as p:

  PCollection = p | "Reading records from database" >> relational_db.ReadFromDB(

    source_config=source_config,

    table_name='xyzzy',

    query="SELECT * FROM xyzzy",

 ) | beam.ParDo(

   AppendCol(), beam.pvalue.AsIter(side_input)

 )

下面是错误


Traceback (most recent call last):

  File "athena/etl.py", line 40, in <module>

    extract()

  File "athena/etl.py", line 22, in extract

    PCollection = p | "Reading records from database" >> relational_db.ReadFromDB(

  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/pipeline.py", line 555, in __exit__

    self.result = self.run()

  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/pipeline.py", line 534, in run

    return self.runner.run_pipeline(self, self._options)

  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 119, in run_pipeline

    return runner.run_pipeline(pipeline, options)

  File "/Users/souvikdey/.pyenv/versions/3.8.5/envs/athena-venv/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 175, in run_pipeline

    self._latest_run_result = self.run_via_runner_api(

我正在从 PostgreSQL 表中读取数据,PCollection 的每个元素都是一个字典。


慕森卡
浏览 87回答 1
1回答

猛跑小猪

我认为问题在于你有两个独立的管道试图一起工作。您应该将所有转换作为单个管道的一部分执行:with beam.Pipeline(options=PipelineOptions()) as p:&nbsp; side_input = p | "Reading from BQ side input" >> relational_db.ReadFromDB(&nbsp; &nbsp; source_config=sideinput_bq_config,&nbsp; &nbsp; table_name='abc',&nbsp; &nbsp; query="SELECT * FROM abc")&nbsp; my_pcoll = p | "Reading records from database" >> relational_db.ReadFromDB(&nbsp; &nbsp; &nbsp; &nbsp; source_config=source_config,&nbsp; &nbsp; &nbsp; &nbsp; table_name='xyzzy',&nbsp; &nbsp; &nbsp; &nbsp; query="SELECT * FROM xyzzy",&nbsp; &nbsp; ) | beam.ParDo(&nbsp; &nbsp; &nbsp; &nbsp; AppendCol(), beam.pvalue.AsIter(side_input))
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Python