超过 10000 个条目的 JMS 序列化程序性能问题

目前我正在构建一个可以更新我的 ElasticSearch 索引的 PHP 命令。


但是,我注意到的一件大事是,当我的数组包含超过 10000 个实体时,序列化实体会花费太多时间。我认为它会是线性的,但是 6 或 9k 实体都需要一分钟(6 或 9k 之间没有太大区别),但是当您超过 10k 时,它只会减慢到最多需要 10 分钟的程度。


...

                // we iterate on the documents previously requested to the sql database

                foreach($entities as $index_name => $entity_array) {

                    $underscoreClassName = $this->toUnderscore($index_name); // elasticsearch understands underscored names

                    $camelcaseClassName = $this->toCamelCase($index_name); // sql understands camelcase names


                    // we get the serialization groups for each index from the config file

                    $groups = $indexesInfos[$underscoreClassName]['types'][$underscoreClassName]['serializer']['groups']; 


                    foreach($entity_array as $entity) {

                        // each entity is serialized as a json array

                        $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));

                        // each serialized entity as json is converted as an Elastica document

                        $documents[$index_name][] = new \Elastica\Document($entityToFind[$index_name][$entity->getId()], $data);

                    }

                }

...

有一整节课都围绕着这件事,但这就是花费大部分时间的事情。


我可以理解序列化是一项繁重的操作并且需要时间,但是为什么 6、7、8 或 9k 之间几乎没有区别,但是当实体超过 10k 时,它需要花费很多时间?


PS:作为参考,我在 github 上打开了一个问题。


编辑 :


为了更准确地解释我想要做的事情,我们在 Symfony 项目上有一个 SQL 数据库,使用 Doctrine 将两者链接起来,并且我们正在使用 ElasticSearch(以及捆绑 FOSElastica 和 Elastica)将我们的数据索引到 ElasticSearch。


问题是,虽然 FOSElastica 负责更新 SQL 数据库中更新的数据,但它不会更新包含此数据的每个索引。(例如,如果你有一个作者和他写的两本书,在 ES 中你会有两本书,里面有作者和作者。FOSElastica 只更新作者,而不是两本书中关于作者的信息)。


因此,为了解决这个问题,我正在编写一个脚本,该脚本侦听通过 Doctrine 完成的每次更新,从而获取与更新相关的每个 ElasticSearch 文档,并对其进行更新。这有效,但在我的压力测试中太长了,需要更新 10000 多个大文档。


编辑 :


要添加有关我尝试过的内容的更多信息,我在使用 FOSElastica 的“populate”命令时遇到了同样的问题。9k的时候,一切都很好,很流畅,10k的时候,真的需要很长时间。


目前我正在运行测试,减少我的脚本中数组的大小并重置它,到目前为止没有运气。


海绵宝宝撒
浏览 117回答 2
2回答

守着星空守着你

我改变了我的算法的工作方式,首先获取所有需要更新的 id,然后以 500-1000 的批次从数据库中获取它们(我正在运行测试)。&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; /*&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; * to avoid creating arrays with too much objects, we loop on the ids and split them by DEFAULT_BATCH_SIZE&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; * this way we get them by packs of DEFAULT_BATCH_SIZE and add them by the same amount&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; */&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for ($i = 0 ; $i < sizeof($idsToRequest) ; $i++) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $currentSetOfIds[] = $idsToRequest[$i];&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // every time we have DEFAULT_BATCH_SIZE ids or if it's the end of the loop we update the documents&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ($i % self::DEFAULT_BATCH_SIZE == 0 || $i == sizeof($idsToRequest)-1) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ($currentSetOfIds) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // retrieves from the database a batch of entities&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $entities = $thatRepo->findBy(array('id' => $currentSetOfIds));&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // serialize and create documents with the entities we got earlier&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; foreach($entities as $entity) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $data = $this->serializer->serialize($entity, 'json', SerializationContext::create()->setGroups($groups));&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $documents[] = new \Elastica\Document($entityToFind[$indexName][$entity->getId()], $data);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // update all the documents serialized&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $elasticaType->updateDocuments($documents);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // reset of arrays&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $currentSetOfIds = [];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $documents = [];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }我正在以相同的数量更新它们,但它仍然没有提高序列化方法的性能。我真的不明白它与序列化程序有什么不同,我有 9k 或 10k 个实体,而它从来不知道......

阿波罗的战车

在我看来,您应该检查内存消耗:您正在构建一个大数组,其中列出了很多对象。您有两种解决方案:使用生成器避免构建该数组,或者尝试每“x”次迭代推送您的文档并重置您的数组。我希望这能让您了解如何处理此类迁移。顺便说一句,我差点忘了告诉你避免使用 ORM/ODM 存储库来检索数据(在迁移脚本中)。问题是他们会使用对象并给它们加水,老实说,在庞大的迁移脚本中,你只会永远等待。如果可能,只需使用 Database 对象,这可能足以满足您的需求。
打开App,查看更多内容
随时随地看视频慕课网APP