Stata: Sequential elimination of observations based on pairwise
comparisons within a group
My Stata dataset contains observations on constituent components of
products created by different players in a simulation. I would like to
retain only the products (created by each player) that consist of distinct
and unique components. Therefore, I have tried to develop an algorithm
which would sequentially eliminate products based on pairwise value
comparisons within a group. Here is the data:
+---------------+----------+--------+-----------+----+----+----+----+---------------+
| simulation_id | prodsrt | player | dupl_prod | n8 | n6 | n4 | n2 |
product_value |
+---------------+----------+--------+-----------+----+----+----+----+---------------+
| 1 | 04091520 | 1 | 0 | 4 | 9 | 15 | 20 |
3,498 |
| 1 | 02081821 | 1 | 0 | 2 | 8 | 18 | 21 |
2,457 |
| 1 | 06101424 | 2 | 0 | 6 | 10 | 14 | 24 |
3,686 |
| 1 | 03071719 | 2 | 9 | 3 | 7 | 17 | 19 |
2,025 |
| 1 | 05111323 | 2 | 7 | 5 | 11 | 13 | 23 |
2,509 |
| 1 | 03121619 | 2 | 2 | 3 | 12 | 16 | 19 |
2,544 |
| 1 | 01111319 | 2 | 4 | 1 | 11 | 13 | 19 |
2,791 |
| 1 | 05071723 | 2 | 5 | 5 | 7 | 17 | 23 |
2,509 |
+---------------+----------+--------+-----------+----+----+----+----+---------------+
The final outcome in this case should look like:
+---------------+----------+--------+-----------+----+----+----+----+---------------+
| simulation_id | prodsrt | player | dupl_prod | n8 | n6 | n4 | n2 |
product_value |
+---------------+----------+--------+-----------+----+----+----+----+---------------+
| 1 | 04091520 | 1 | 0 | 4 | 9 | 15 | 20 |
3,498 |
| 1 | 02081821 | 1 | 0 | 2 | 8 | 18 | 21 |
2,457 |
| 1 | 06101424 | 2 | 0 | 6 | 10 | 14 | 24 |
3,686 |
| 1 | 01111319 | 2 | 4 | 1 | 11 | 13 | 19 |
2,791 |
| 1 | 05071723 | 2 | 5 | 5 | 7 | 17 | 23 |
2,509 |
+---------------+----------+--------+-----------+----+----+----+----+---------------+The
idea was to: 1) Rank all the "problematic" (for which dupl_prod !=0)
products by value for each player 2) Pick the product with the max value
product_value and the second best 3) For the pair of products, check if
there are overlapping components between the two: a. If there are
overlapping components, drop the product with inferior value b. If there
are no overlapping components, retain both products but exclude the
product with inferior value from "problematic" set 4) Re-rank remaining
products and repeat the same procedure until only non-overlapping products
remain
Since the number of "problematic" products to scan through varies by
player and by simulation, I would need to run this routine the maximum
number of times pairwise comparison is required.
The current version of the code looks like as follows:
forval x = 1/2 {
bysort simulation_id player (product_value): gen rank = sum(product_value
!= product_value[_n-1]) if dupl_prod!=0
bysort simulation_id player (rank): egen maxrank= max(rank) if dupl_prod!=0
gen active = cond(missing(rank),.,cond(rank==maxrank | rank==maxrank-1,1,.))
gen dupl_cell =.
local i=2
while `i'<15 { /* max number of digits in productid is equal to 14*/
sort simulation_id player active n`i'
quietly by simulation_id player active n`i': gen dupl_n`i' =
cond(missing(n`i'), ., cond(_N==1,0,_n))
egen temp = rowtotal(dupl_n*) if `i'==14
replace dupl_cell = temp
drop temp
local i=`i'+2
}
gen test = cond(missing(dupl_cell),.,cond(rank!=maxrank,1,0)) if dupl_cell!=0
drop if test==1
replace dupl_prod = 0 if dupl_prod!=0 & dupl_cell==0 & rank!=maxrank &
active==1
drop test rank maxrank active dupl_cell drop dupl_n*
}
It does not give any error, but neither produces the desired result
because it retains only the initial first –best product. Moreover, even
when the number of repetitions is set to 2 it still produces the same
result, although for this to happen more than 2 iterations for "forval"
loop should be executed.
No comments:
Post a Comment