猿问

计算帖子的相似度

我正在使用php 7.3并且正在计算帖子的相似性。


<?php


$posts = [

    'post_count' => 3,

    'posts' => [

        [

            'ID' => 1,

            'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",

        ],

        [

            'ID' => 2,

            'post_content' => "Lorem ipsum dolor sit"

        ],

        [

            'ID' => 3,

            'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."

        ],

        [

            'ID' => 4,

            'post_content' => "Lorem ipsum dolor sit"

        ],

    ]

];


print_r($posts);


function getNonSimilarTexts($posts)

{

    $similarityPercentageArr = array();


    for ($i = 0; $i <= $posts['post_count']; $i++) {

        // $posts->the_post();

        $currentPost = $posts['posts'][$i];

        if (!is_null($currentPost['ID'])) {

            for ($y = 0; $y <= $posts['post_count']; $y++) {

                $comparePost = $posts['posts'][$y];

                if (!is_null($comparePost['ID'])) {

                    similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);

                    // similarity is 100 if self compare

                    if ($perc != 100) {

                        array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);

                    }

                }

            }

        }

    }

    return $similarityPercentageArr;

}


$p = getNonSimilarTexts($posts);

print_r($p);

如您所见,我得到一个数组作为输出[[ID, ID, similarity_percentage],...]


我想过滤这个数组并去掉所有相似之处,>20%此外,我想只保留 1 个相似的帖子并删除 ohters。我想要的结果是帖子 ID:1,2,3


有什么建议如何过滤这样的数组吗?


慕少森
浏览 158回答 2
2回答

慕森卡

您可以立即进行过滤,将条件更改if ($perc != 100)为if ($perc > 20),以便只保留您想要删除的类似帖子。然后,您甚至可以完全跳过存储相似性,因为您已经有了要删除的帖子 ID 数组列表。所以,当你有这样的代码时:if ($perc > 20) {&nbsp; &nbsp; $similarityPercentageArr[$currentPost['ID']][] = $comparePost['ID'];}然后,您可以像这样删除所有不需要的帖子:$postsToRemove = [];$postsToKeep = [];foreach ($similarityPercentageArr as $postId => $similarPostIds) {&nbsp; &nbsp; // this post has already appeared as similar somewhere, so its similar posts have already been added&nbsp;&nbsp; &nbsp; if (in_array($postId, $postsToRemove)) {&nbsp; &nbsp; &nbsp; &nbsp; continue;&nbsp; &nbsp; }&nbsp; &nbsp; $postsToKeep[] = $postId;&nbsp; &nbsp; $postsToRemove = array_merge($postsToRemove, $similarPostIds);}现在您在 中拥有原始帖子 ID $postsToKeep,以及在 中的相似之处的 ID $postsToRemove。我还会稍微优化一下代码,这样similar_text当您知道您正在将帖子与其自身进行比较时,您根本不会调用。因此,if (!is_null($comparePost['ID']))您将拥有if (!is_null($comparePost['ID']) && $comparePost['ID'] !== $currentPost['ID']).

大话西游666

similar_text&nbsp;—&nbsp;Calculate&nbsp;the&nbsp;similarity&nbsp;between&nbsp;two&nbsp;strings莱文斯坦levenshtein&nbsp;—&nbsp;Calculate&nbsp;Levenshtein&nbsp;distance&nbsp;between&nbsp;two&nbsp;strings声音soundex&nbsp;—&nbsp;Calculate&nbsp;the&nbsp;soundex&nbsp;key&nbsp;of&nbsp;a&nbsp;string关于您的问题,在阅读后,似乎标题与您的查询不太匹配!仅仅通过另一个条件还不够吗?<?php$posts = [&nbsp; &nbsp; 'post_count' => 3,&nbsp; &nbsp; 'posts' => [&nbsp; &nbsp; &nbsp; &nbsp; [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'ID' => 1,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'post_content' => "Wrong do point avoid by fruit learn or in death. So passage however besides invited comfort elderly be me. Walls began of child civil am heard hoped my. Satisfied pretended mr on do determine by.",&nbsp; &nbsp; &nbsp; &nbsp; ],&nbsp; &nbsp; &nbsp; &nbsp; [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'ID' => 2,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'post_content' => "Lorem ipsum dolor sit"&nbsp; &nbsp; &nbsp; &nbsp; ],&nbsp; &nbsp; &nbsp; &nbsp; [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'ID' => 3,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'post_content' => "Months on ye at by esteem desire warmth former. Sure that that way gave any fond now. His boy middleton sir nor engrossed affection excellent."&nbsp; &nbsp; &nbsp; &nbsp; ],&nbsp; &nbsp; &nbsp; &nbsp; [&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'ID' => 4,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 'post_content' => "Lorem ipsum dolor sit"&nbsp; &nbsp; &nbsp; &nbsp; ],&nbsp; &nbsp; ]];print_r($posts);function getNonSimilarTexts($posts){&nbsp; &nbsp; $similarityPercentageArr = array();&nbsp; &nbsp; for ($i = 0; $i <= $posts['post_count']; $i++) {&nbsp; &nbsp; &nbsp; &nbsp; // $posts->the_post();&nbsp; &nbsp; &nbsp; &nbsp; $currentPost = $posts['posts'][$i];&nbsp; &nbsp; &nbsp; &nbsp; if (!is_null($currentPost['ID'])) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; for ($y = 0; $y <= $posts['post_count']; $y++) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; $comparePost = $posts['posts'][$y];&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if (!is_null($comparePost['ID'])) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; similar_text(strip_tags($currentPost['post_content']), strip_tags($comparePost['post_content']), $perc);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; // similarity is 100 if self compare and more than 20&nbsp;&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; if ($perc != 100 && $perc > 20) {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; array_push($similarityPercentageArr, [$currentPost['ID'], $comparePost['ID'], $perc]);&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; }&nbsp; &nbsp; return $similarityPercentageArr;}$p = getNonSimilarTexts($posts);print_r($p);输出:Array(&nbsp; &nbsp; [0] => Array&nbsp; &nbsp; &nbsp; &nbsp; (&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [0] => 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [1] => 3&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [2] => 23.145400593472&nbsp; &nbsp; &nbsp; &nbsp; ))
随时随地看视频慕课网APP
我要回答