catspeake
我找到了解决办法。我发布了代码,它可能对某人有帮助。#this are the data, generated randomically with a given shapernd = np.random.random(size=(10**7, 8))#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)#I added other 7 columns, with varing range of values (all upper than 0.7)attr1 = np.random.uniform(0.8, .95, size = (8,1))#attr2,3,4,5,6,7 like attr1#corr_mat is the matrix, union of columnscorr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))from statsmodels.stats.correlation_tools import cov_nearest#using that function i found the nearest covariance matrix to my matrix,#to be sure that it's positive definitea = cov_nearest(corr_mat)from scipy.linalg import choleskyupper_chol = cholesky(a)# Finally, compute the inner product of upper_chol and rndans = rnd @ upper_chol#ans now has randomically correlated data (high correlation, but is customizable)#next i create a pandas Dataframe with ans valuesdf = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4', 'att5', 'att6', 'att7', 'att8'])#last step is to truncate float values of ans in a variable way, so i got #duplicates in varying percentagea = df.valuesfor i in range(8): trunc = np.random.randint(5,12) print(trunc) a.T[i] = a.T[i].round(decimals=trunc)#float values of ans have 16 decimals, so i randomically choose an int# between 5 and 12 and i use it to truncate each value最后,这些是我每列的重复百分比:duplicate rate attribute: att1 = 5.159390000000002duplicate rate attribute: att2 = 11.852260000000001duplicate rate attribute: att3 = 12.036079999999998duplicate rate attribute: att4 = 35.10611duplicate rate attribute: att5 = 4.6471599999999995duplicate rate attribute: att6 = 35.46553duplicate rate attribute: att7 = 0.49115000000000464duplicate rate attribute: att8 = 37.33252