-
Notifications
You must be signed in to change notification settings - Fork 77
Compare DataFrames #1556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Compare DataFrames #1556
Conversation
…es, there is no difference in the logic
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| // row at index 'x-1' of dfA was removed | ||
| xPrev + 1 == x && yPrev + 1 != y -> { | ||
| comparisonDf = comparisonDf.concat( | ||
| dataFrameOf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strange formatting, can you lint this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also probably best to add argument names to ComparisonDescription, "magic" argumentless constants are often a source of bugs :)
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| internal class ComparisonDescription( | ||
| val rowAtIndex: Int, | ||
| val of: String, | ||
| val wasRemoved: Boolean?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this results in true or null... what's wrong with false? Same below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so... rows can either be removed or inserted, right? nothing else. Let's turn these two booleans in an enum as well then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose for null instead of false to remark that if wasInserted is true, wasRemoved column does not even make sense (and vive-versa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see; well, in that case an enum makes more sense I think
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/compareDataFrames.kt
Outdated
Show resolved
Hide resolved
| val of: String, | ||
| val wasRemoved: Boolean?, | ||
| val wasInserted: Boolean?, | ||
| val afterRow: Int?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this insertedAfterRow or something more expressive. That explains why it's null if wasRemoved == true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but honestly, this seems a thing AI would do wrong. If you're using AI, that's okay; however, you remain the one responsible for the code.
At first you had:
val wasRemoved: Boolean?
val wasInserted: Boolean?
val afterRow: Int?I remarked here: #1556 (comment) both values had two options: either null or true. This is an odd thing to do with booleans which, as you know, have 2 values already: true or false, however, you explained why, so that makes slightly more sense now :)
What I don't understand is why you now have two nullable RowOfComparisons:
val wasRemoved: RowOfComparison?
val wasInserted: RowOfComparison?What would these even contain? A row is either 'removed' (WAS_REMOVED) or 'inserted' (WAS_INSERTED_AFTER_ROW), right? So why not just make one column called modification: RowOfComparison?
Next, I meant you to rename afterRow to insertedAfterRow, not wasInserted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, no AI at all was used in this code. I misunderstood #1556 (comment),
I thought that I had to create an enum whose constants had an exact correspondance with boolean values true and false (so that the schema could remain the same) .
However, now I understood what You meant :), I proceed to correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I take back what I said :) Thanks for the explanation!
| val wasRemoved: Boolean?, | ||
| val wasInserted: Boolean?, | ||
| val afterRow: Int?, | ||
| ) : DataRowSchema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could include the modified DataRow<*> in the resulting DataFrame as well. That could make it a bit easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myers difference algorithm exploits the idea of comparison in a boolean sense,
a member (row) of the data structure (df) is equal or not-equal to another member.
In this logic a modified row is represented by two DataRow<ComparisonDescription> ,
One row is like: row n of dfA was removed and the other: row m of dfB was inserted after row n-1 (of dfA).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo representing explicitly modified rows means customing Myers Alg logic by introducing a 3-possible-output non boolean logic of comparison: Equal, Non Equal, Similar. Similar means that compared row differ for a limited number of elements (that may be proportional to row's length).
In a lower abstraction level, 'similar' looks like a 'flagged' diagonal move in the edit script graph
-> list of Pair would be no more enough to represent the result, we would need a list of 3-ple.
Anyway, imo, this slution has the following weakness: the result is less neutral because the previous definition of 'Similarity' can't determine with certainty wheter a row was actually modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not (yet) asking for in-row differences. Simply for adding the original row to a new column when that row was "Not Equal", so either "removed" or "inserted". A ComparisonDescription like row n of dfA was removed doesn't really help me if I can't see what "row n of dfA" actually contains. Similar to row m of dfB.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would make the comparison output more independent. I can try to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would make the comparison output more independent. I can try to implement it.
Done, i added a DataRow<T> column to ComparisonDescription<T> representing the content of the row.
FIXES #658
It makes possible to compare DataFrame by exploiting Myers difference algotithm whose cost is O((M+N)*D) .
M is length of dfA, N is length of dfB, D is length of shortest edit script to get B from A.
Returns a DataFrame< ComparisonDescription >,
ComparisonDescription is a schema created specifically for this use case.
It comes with a proper test case.
About Myers difference algotithm:
https://neil.fraser.name/writing/diff/myers.pdf