我有一个带有预定义文本的 Xlsx 文件,其中只有一列。用户将输入一个或多个单词,输出将是包含一个或多个单词的文本。
import numpy as np
import pandas as pd
import time
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from sklearn.metrics.pairwise import pairwise_distances
import pickle
def load_df(path):
df = pd.read_excel(path)
print(df.shape)
return df
def splitDataFrameList(df, target_column, separator):
def splitListToRows(row, row_accumulator, target_column, separator):
split_row = row[target_column].split(separator)
for s in split_row:
new_row = row.to_dict()
new_row[target_column] = s
row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows, axis=1, args=(new_rows, target_column, separator))
new_df = pd.DataFrame(new_rows)
return new_df
class Autocompleter:
def __init__(self):
pass
def import_json(self, json_filename):
print("load Excel file...")
df = load_df(json_filename)
return df
def process_data(self, new_df):
# print("select representative threads...")
# new_df = new_df[new_df.IsFromCustomer == False]
print("split sentenses on punctuation...")
for sep in ['. ', ', ', '? ', '! ', '; ']:
new_df = splitDataFrameList(new_df, 'UserSays', sep)
print("UserSays Cleaning using simple regex...")
在输入中,如果我什么都不输入,它会为我提供这个输出
['How to access outlook on open network?', 'Email access outside ril network', 'Log in outlook away from office']
这是不希望的,如果只有一个文本匹配它会给出以下输出
input - sccm
['What is sccm', 'How to access outlook on open network?', 'Email access outside ril network']
我希望以这样的方式输出,如果输入的单词或单词不存在于 xlsx 文件中,那么输出不应该返回任何东西。和
繁花不似锦
相关分类